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(57) ABSTRACT 

Method and apparatus for combining a plurality of overlap- 
ping policy-based controllers. System also applicable to 
policy-based process servers. System combines controllers 
by combining the respective policy information. System 
combines a plurality of policy-based sub-controllers by 
combining the associated distributional information con- 
tained in the associated sub-policies. An iterative mixture 
mechanism with temporal persistence regulates the relative 
contribution of the sub-policies smoothly over time thereby 
allowing smooth transition of control from one control 
regime to another. The system provides for modular detec- 
tion and resolution of conflicts that may arise as a result of 
combining otherwise incompatible sub-policies. Preferred 
embodiment performs mixture method in policy space. 
Another embodiment applies mixture method to value func- 
tions associated with each sub-server. 

24 Claims, 24 Drawing Sheets 




10/14/2003, EAST Version: 1.04.0000 



U.S. Patent Oct. 29, 2002 Sheet 1 of 24 US 6,473,851 Bl 



A Stochastic Policy 




12 3 4 5 



Action ID 



Figure IA 



10/14/2003, EAST Version: 1.04.0000 



U.S. Patent Oct. 29, 20a2 Sheet 2 of 24 US 6,473,851 Bl 




Figure IB 



10/14/2003, EAST Version: 1.04.0000 



U.S. Patent Oct. 29, 2002 Sheet 3 of 24 



US 6,473,851 Bl 



A Fuzzy Policy 
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FigorelF 
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Figure 1H 
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Figure II 
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Figure 1 J 
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FIGURE 3 
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601^ 

Schema of Policy Specification Table: Policyld, PolicyType 
where valid PolicyType's are {fuzzy, stochastic}, 
and valid Policyld's are {l,2,...,v} 



602~\ . 

Schema of Policy Table: Policyld, Actionld, ActionWeight, 
where Actionld is in (1,2,...,#A}, 

and ActionWeight is a real-valued number taking values in [0,1] if PolicyType is 
"stochastic" and otherwise takes any floating point value. 



603A 

Schema of Action Table: Actionld, ActionSpecifier 
where Actionld is in {1,2,... 5 #A}, 

and ActionSpecifier is a code that specifies the action attributes or provides an 
index into a table that contains this information. 
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1) Rank-order actions according to necessity in descending order (least necessary at end of list) 

2) Mark all actions that conflict with any other actions in the list. 

3) While there remain actions on the list that conflict with any others on the list: 

a. remove lowest ranking marked action from list 

b. clear all marks 

c. mark all actions that conflict with any other actions in the list 

4) Replace output policy with new policy obtained by deleting all actions not in list generated by Step 3. 
Leave action weinhts (decree of membership in remainina actions unchaneed. 
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RESOLVE CONFLICTS 
WITH ONGOING ACTIONS 




1. Rank-order actions according lo necessity in descending order (least necessary at end of list) 

2. Mark all actions that conflict with any ongoing actions, 

3. While there remain actions on the list that conflicl with any ongoing actions: 

a. remove lowest ranking marked action 

b. clear all marks 

c. mark all actions that conflict with any ongoing actions 

4. Replace output policy with new policy obtained by deleting all actions not in list generated by Step 3. 

5) If Output Policy Type is Fuz2y: Leave action weights (degree of membership) in remaining actions 
unchanged. 

6) If Output Policy Type is Stochastic: Renormalize policy distribution to be probabilistic (i.e., so that 
action weights sum to 1.0). 
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1) Rank-order actions according to necessity in descending order (least necessary at end of list) 

2) Mark all actions that conflict with actions triggered in previous time step. 
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a. remove lowest ranking marked action 

b. clear all marks 

c. mark all actions that conflict with any actions triggered in previous time step. 

4) Replace output policy with new policy obtained by deleting all actions not in list generated by Step 

5) If Output Policy Type is Fuzzy: Leave action weights (degree of membership) in remaining actions 
unchanged. 

6) If Output Policy Type is Stochastic: Renormalize policy distribution to be probabilistic (i.e., so that 
action weights sum to 1 .0). 
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Submit policy to output interface: given current stimulus s t , for each action ae A do the following: 

1. if a was omitted from the output policy due to conflict detection checks then 

♦ if output interface format is a database table or file then flag action a as " invalid" - e.g., 

a. remove the record for that action from the table containing the output policy, or 

b. use a field within that action's record schema to represent the flag, or 

c. set the weight for that action to a special value, such as "indeterminate," "null," or "empty." 

• If output interface format is electronic fixed- width control signal format then latch the weight for 
action a to a special value that corresponds to that action's control signal being disabled. 

2. If Step 1 is not performed for action a then set the weight for action a to n^if^a) accordingly. 
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FIGURE 15: a perspective view of an exemplary signal -bearing medium in accordance with one 
embodiment of the invention. 
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SYSTEM FOR COMBINING PLURALITY OF 2. Controllers can Trigger "Actions" as well as "Proce- 

INPUT CONTROL POLICIES TO PROVIDE A dur ff L ,«..«.- ~ a « ,■ 

rOMPOSITTONAT OUTPUT CONTROL Although we speak about actions and action 

COMFUMllUNAJj UUif ui selection," the controllers described in this document can 

also regulate procedures. Therefore, an "action selection 
module" as defined here can control (a) instantaneous 
actions, (b) ballistic (non-intcrruptible and non-modifiable) 

BACKGROUND OF THE INVENTION act j on sequences, but can also regulate (c) ongoing physical 

processes or (d) branching procedures. 

1. Field of Invention Actions controlled or initiated by a policy can be 

The present invention generally relates to policy-based 10 1 . Momentary or instantaneous: e.g., flash a light bulb, flip 

controllers and policy-based process servers. a switch. 

2. Background-Discussion of Prior Art 2 " Continuous: e.g., gradually increment the temperature 

b or a furnace over time. 
This section puts the invention into its proper context. We 3 procedural: initiate a multiple step and possibly branch- 
provide a cursory background and define required terminol- 15 jug computer program, 
ogy. Readers unfamiliar with stochastic control, reinforce- Furthermore, actions can be 

me nt learning, or optimal process control may find the next 1. Discrete: e.g. a database containing a finite set of 

several subsections helpful in defining the fundamental actions indexed by an integer record pointer. An 

underlying technologies. Readers very familiar with these example of this is an web-based ad server for the 

topics should at least skim these sections to review general 20 purpose of displaying a particular ad targeted at a 

terminology. website visitor. 

A. Scope of Applicability and Main Concepts 2. Continuous: e^g a possibly multidimensional control 

ny V j _ j l i c c, signal indexed by a point within a Euclidean vector 

This invention is closely related to technologies of Sto- such ^ ' M ; lcctronic system . ^ 

chastic Control and Reinforcement Learning. Control sys- ^ examp i e 0 f this is an electronic vacuum pressure regu- 

tems technology is rather welWeveloped and has numerous Utor inside ^ automobUc . 

sub-areas. Because of this the reader may be accustomed to We refer to act j ons f or simplicity but without loss of 

different terminology to refer to me concepts used here. The generality because an action can mean triggering a 

terminology we use is in line with definitions employed in procedure, parameterizing the initial state of a procedure, or 

[Kaelbling Littman and Moore 1996] and [Sutton and Barto ^ modifying state information used by an ongoing procedure. 

1998], which provide background survey information, tuto- 3. Compatible with Reinforcement Learning Technolo- 

rial treatment, precise definitions of technical concepts dis- gies 

cussed here, and as well as a clear explanation of the prior Although this invention does not provide new technology 

ar t_ for learning per se all the policy and control mechanisms 



described here are compatible with the general framework of 




are 

tion. We try 



c e i- j - modularization of the data structures and mechanisms 

jargon. Crucial technical defimhons are formalized using formulating policy and executing policy) reduce 

mathematical notation m the sections tided "Formal Defi- ^ alional b J e n of obtaining potic£ infinitum, 

nmon of Prior Art' and "Formal Definition of the Mixture of w Various statistical, computational, and programming tech- 

Policies Framework.' nologies can be applied to obtain a policy. These technolo- 

1. Separation of Policy and Execution gies are well developed and include a wide variety of 

In the technical jargon of control theory, the mapping of computational, statistical, and electronic methods. Methods 

a stimulus to a set of action tendencies is referred to as a for obtaining or refining policy include (a) explicit 

"policy." Given a set of candidate actions and a stimulus, a « programming, (b) direct computation, (c) evolutionary 

policy is a function that recommends one or more actions in design, (<0 evolutionary programming (e) computerized 

response to the given stimulus. Stochastic Control pertains discovery over historical data stores, (f) computerized sta- 
* *. technology of using a s« . — ng 

action selection processes FIGS 1A and IB dhistrate lm ' ^ Uqok loofi] aod 

examples of pohcies. An action selection module then uses 50 [SuUo / and ^ l998 f for a review and additional refer, 

a policy to guide its selection of the action or actions from ^ nces 

the permissible set of candidate actions. Some control A policy can be 

mechanisms specified in the prior artdo not separate policy x Probabilistic: actions are weighted by a probability 

from execution, but here we do. The essential concepts distribution over the action database. In this case the 

remain whether or not the execution mechanism is inextji- 55 action selection mo dule picks one action at random 

cably intertwined with the policy data structure or separated drawn according to this distribution. See for instance 

as is the case here. The policy "recommends" actions, the rqc. 1A. 

action selection module "executes" one or more actions 2 . Deterministic: only a single action is recommended, 

according to this recommendation. This execution mecba- f or instance FIG IB. 

nism can be straightforward, such as the greedy method of 60 The field of Reinforcement Learning provides technolo- 

always selecting the highest ranked action. Or it can be more gi es f or systematically learning, discovering, or evolving 

involved, such for example additional checks are made to policies suitable for stochastic control. Reinforcement learn- 

determine whether an action will conflict with other ongoing ing theory is a fairly mature technology. The field of Fuzzy 

actions before triggering it. (See the tutorial references Control modifies this functionality to allow the following: 

[Kaelbling Littman and Moore 1996] and [Sutton and Barto 65 3. Fuzzy Membership Assignment: a distribution 

1998] for more discussion of how to convert policy infor- (possibly non -probabilistic) is applied over the actions 

matioo into action selection procedures.) in the action database. See FIG 1C. 
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Given a fuzzy policy the action selection module simul- a gating mechanism to select from among the sub-policies, 

taneously applies one or more of the actions. Therefore, This approach has numerous variations and encapsulates 

fuzzy control as denned here allows multiple actions to be numerous complexities that are not exhaustively described 

triggered in parallel. Moreover the action selection mecha- here; however, a simple high-level illustration of the essen- 

nism may also utilize the weighting specified by the distri- 5 tial features of the general approach relevant to this inven- 

bution to initialize parameters of each action. Sec for tion is depicted in FIG ID. 

instance FIG 1C Given a stimulus s, this "gated policy" mechanism selects 

The definition of Fuzzy Policy we use here may be the sub-policy appropriate for the stimulus at hand, passing 

inconsistent with definitions used in prior art, and is not that policy through to an action selection module, which 

included in the tutorial treatment explained in [Kaelbling 10 then executes that sub-policy upon the given stimulus. 

Littman and Moore 1996] and [Sutton and Barto 1998], "Stimulus" as defined here is quite general, encompassing 

which concentrate exclusively on stochastic control. external sensory stimuli as well as state space accessed 

However, Fuzzy Policy as defined here is related to "fuzzy within internal memory. 

sets" in that they both specify "degree of membership" The gated policy approach can make executing or leara- 
rather than "probability." Fuzzy Policy as defined here also 15 ing stochastic policy information more efficient. It stream- 
allows more than one action to be selected in parallel by the lines the acquisition of policy, say, by computerized 
action selection mechanism, whereas a stochastic policy discovery, exhaustive search, reinforcement learning, or 
expects only a single action to be selected at one moment in iterative evolution. This is because sub -policies may be 
ume , more easily obtained individually than can a single mono- 

4. A Policy is a Mulit-valued "Recommendation," a Value 20 lithic policy. It also streamlines the subsequent refinement of 
Function is a "Ranking" a complex control policy by allowing "learning" to occur 

Closely related to the notion of "policy" is the "value hierarchically at multiple levels of description. (Note that 

function." Rather than a probabilistic distribution over the while FIG ID depicts a single level of sub-policies the 

action database, a value function assigns a numerical weight method can be applied to each of the sub-policies to generate 

to each action. A policy formulation mechanism then con- 25 an additional level in the hierarchy, and this decomposition 

verts this value function into a policy. What we define as a can be applied repeatedly to obtain a hierarchy with multiple 

"fuzzy policy" suffices for representing value functions. levels.) The modular policy approach also streamlines the 

Therefore, we can manipulate value functions by treating execution of policy, because multiple simpler sub-policies 

them as Fuzzy Policies. can replace a complex monolithic policy. It also allows 

Technology for converting a single value function into a 30 policies stored in different data structures to be combined 
policy is standard fare in prior art cited here. However, prior (e.g., compact maps, database tables, decision trees, proce- 
art does not address the combination of multiple value dural code). Therefore, this general approach of "divide- 
functions (see FIG 1G) or the simultaneous collapse of and-conquer" has numerous valuable benefits. Methods that 
multiple value functions into a single stochastic policy (see can make efficient use of modular policies have several 
FIG. 11), or the convergence of multiple stochastic policies 35 practical advantages over methods that wield a monolithic 
in order to obtain a new value function (see FIG 1H). policy. 

5. General Applicability and Specific Practical Advan- C. Formal Definition of Prior Art 

tages Here we formalize the concepts introduced above. 

This invention is generally compatible with the technolo- Current implementations of process controllers typically 
gies of reinforcement learning, stochastic control, and fuzzy 40 employ a single method for defining policy (e.g., rules- 
control. Therefore it has broad scope because of the broad based, or statistical, but not both). Current technologies 
scope of these technologies. These wide-ranging technolo- based upon a purely rules-based approach can require a large 
gies can be used to leverage this invention in a wide variety number of rules that take up much space and are costly to 
of ways. Despite the wide-ranging theoretical applicability evaluate in real-time. Current applications of machine learn- 
of these technologies they have limits in certain practical 45 ing and datamining embedded in commercially available 
applications. The next section homes in those limitations process controllers are good for operating on some types of 
that are relevant to this invention. data but limited upon others. (E.g., a web-based personal- 
is. Brief Overview of Prior Art ization server based upon collaborative filtering is good for 

For comprehensive survey or tutorial treatment see inferring preference based upon on-site browsing behavior 

[Kaelbling Littman and Moore 1996] or [Sutton and Barto 50 but may be much less useful for deducing preference from 

1998]. We proceed directly to discussing the currently most an explicit profile provided via questionnaire.) Also, 

advanced technology upon which this invention serves to machine learning methods are great for learning from 

improve. example, but are also largely limited to learning from 

One of the key constraints upon efficient execution of example — users often need more direct control of the pro- 
stochastic control is the computational complexity of the 55 cess controller, e.g. by encoding certain rules of behavior 
policy information. For background see especially the dis- explicitly. Therefore, different tasks call for different control 
cussion on compact mappings in the tutorial references strategies, and different control strategies call for different 
[Kaelbling Littman and Moore 1996] and [Sutton and Barto data structures storing policy information, and different 
1998]. However, compact mappings do not completely strategies for obtaining or refining that policy information, 
alleviate the computational cost of learning and executing 60 Even though there are numerous types of data structures 
complex policies. Although a compact map does provide for encoding policy information these types can be unified 
size and speed advantages over a method relying upon less within a single general framework using concepts from 
compact data structures, even this approach will rapidly be reinforcement learning. The reinforcement learning termi- 
overwhelmed by the complexity of common practical tasks. nology we employ here equates "agent" with "process 
Additional efficiencies can be gained by breaking down a 65 controller" or "process server" so we will refer to an "agent" 
policy into modular sub-components. The "gated policy" henceforth instead of "controller." The concept of "agent" is 
approach splits the policy into a set of sub-policies and uses also more general than the term "controller," and is more 
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appropriate for tbe computational server applications being 
emphasized here. 

Consider an "agent" located in an environment The 
agent's "environmental state" or, "stimulus" is a (possibly 
highly processed) version of the environment "external" to 
the agent. Therefore, whereas (in typical usage of the term) 
a "controller" reacts to sensory information directly or 
subsequent to some numerical processing, an "agent" can 
react to highly processed information. The agent's external 
sensors and internal state memory define this stimulus state, 
which we model as a d-dimensioned real-valued vector 
space: 

TTiis state could, for example, be the onsite behavior of a 
website shopper, such as shopping basket contents or page 
view sequence. Or it could be based upon statistics inferred 
from historical memory of past purchases by that shopper. In 
this example the candidate actions each could select a single 
product recommendation from among a large set of avail- 
able products, or sort a list of product recommendations in 
a particular way, or display a link to a particular page. 
Alternatively, this state could be the stimulus experienced by 
a robotic toy doll, and the candidate actions each select an 
appropriate facial expression and body pose in reaction to 
that stimulus. 

For simplicity, we will take the set of available actions to 
be a discrete set of r actions for some integer n 

Each action aeA is a pointer into a database of r proce- 
dural routines. A(s)c=Agives the actions available while in 
state seS. 

Continuous action spaces are useful for some 
applications, but are not necessary to illustrate the main 
concepts being described here. For clarity we introduce the 
main concepts using discrete action spaces. It is straightfor- 
ward to extend these concepts to continuous action spaces 
and the mechanisms for doing so are rather obvious to the 
informed technologist by drawing upon references such as 



for state s ( c:S r at time t choosing action a r with probability 

A static stochastic policy is one where no adaptation 
occurs over time such that Jt,(s,a)-Jt(s,a), t— 1,2, . . . First, we 
consider a policy that is not modified by learning over 
previous actions during the lifetime of the agent. For stimu- 
10 his state scS at time t and action acA, static stochastic 
policy ji, sets the probability with which action a is chosen 
to 



15 



25 



30 



Note that stochastic control subsumes deterministic con- 
trol; therefore, this type of policy can implement determin- 
istic behaviors (e.g., via simple rules or procedural script). A 
number of ways exist to compose an action selection rule 
from a policy which we omit here for brevity (case studies 
are provided in [Sutton and Barto, 1998] and [Kaelbling, 
Littman, and Moore, 1996]). Additional details for convert- 
ing policy to action selection and for learning or evolving 
policy are omitted because the essentials of this patent are 
focused mainly within policy formulation and combination 
and these details are easily obtained from the references to 
prior art cited here. Intuitively, a policy ranks the list of 
candidate actions from which action selection thereby 
selects a single action function according to that ranking. 

Fuzzy controllers as defined here can trigger multiple 
actions in parallel. Also, because a "fuzzy policy" as defined 
here is a non-probabilistic distribution, a fuzzy policy for- 
mally subsumes stochastic policy. But we describe the prior 
art involving the triggering of single actions under the 
stochastic framework for several reasons, (a) It is often quite 
straightforward to reduce the simultaneous triggering of 
multiple actions into the framework of single actions, (b) 
Stochastic control is more familiar to experts and practitio- 
ners of intelligent control technologies, (c) It is easier to 



[Kaelbling Littman and Moore 19961 or [Sutton and Barto 40 describe the general mechanism by considering the special 



1998] for guidance. 

Consider a sequence of stimuli s lt s^, S3, . . . For each 
t-1,2, . . . , a "policy" n applies a linear order to the set of 
actions available for responding to stimuli s r Above we 



case of stochastic control than if we attempt to retain full 
generality throughout the entire discussion. Upon recogniz- 
ing the drawbacks of prior art and the specific advantages of 
this invention, a reasonably capable expert can easily extend 



briefly mentioned the distinction between a value function 45 this method to apply to fuzzy policy without requiring any 
and a policy — the tutorial texts referenced above describe " * L c ~ J L " J 

this distinction very clearly. Ultimately the value function 
must be converted to a policy when applied to action 
selection and so controllers based upon the modular policy 



insights that are not obvious from reading this document or 
from tbe prior art cited here. 

The notation used to denote policy thus far does not admit 
real-time learning. Reinforcement learning allows a policy 



approach commonly apply the modularity within "policy 50 to depend upon (i.e., be conditioned on) previous events 

experienced by the agent. Therefore, we have a dynamic 
stochastic policy it% that for state scS chooses action a with 
probability 



space" rather than in "value function space." However, one 
embodiment of this invention (described in tbe specification 
and claims provided below) is suitable for combining policy 
information in "value function space." For clarity in expla- 
nation and simpler notation we confine our description to 55 
"policy space." Upon recognizing the drawbacks of prior art 
and tbe specific advantages of this invention, a reasonably 
capable expert can easily extend this method to apply to 
value functions without requiring any insights that are not 
obvious from reading this document or from the prior art 
cited here. 

Intuitively, a policy can be said to model a set of 
"behaviors " or "action tendencies." A policy can be deter- 
ministic (say, choose the highest ranked action as indicated 
by a value function) or stochastic (i.e., select one of the 
actions probabilistically). A stochastic policy implements 
the mapping: 



60 



65 



where now policy execution over state space (the current 
action ranking) is function of the k previous actions and 
stimuli: 

where a'-* and s*"* are the historical sequences of the k 
previous actions and states respectively, such that a'^a,,^ 

*m a,_„ and s'-*^^ s,.^, . . . , s,^. For simplicity 

in what follows we'll let k=t (indefinite memory), and 
denote a'=a (,/ , S f =s , ' r , iCfT&'c and refer to sC t instead of 
i£ J r Where confusion will not arise we may abuse notation 
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slightly and use if, (s, a) rather than jf n (s, a, a', s% so long 
as it is clear that the computation of if, depends upon 
previous slates and actions, whereas it does not for n and it,. 
Reinforcement learning and supervised learning theories 
each provide several mechanisms entirely suitable for com- 
puting f (and thereby, jf^. For a survey of these mechanisms 
see [Sutton and Barto, 1998] and [Kaelbling, Littman, and 
Moore, 1996]. 

Different ways of encoding policy are useful for different 
purposes. A static policy is useful for encoding simple rules 
(say, describing expert intuition). Adynamic policy acquired 
in real-time via statistical learning is good for tracking user 
behavior via passive observation. In theory, we can easily 
combine these into a single policy. But in practice, there are 
good reasons to keep each type of policy separate. One 
reason is computational efficiency. Simple rules can be 
efficiently coded as a look-up table. On the other hand, a 
functional form that is efficient for a simple policy k, (say, 
requiring only a small table of rules) will in general be 
inefficient for a complex policy if, (for which a compact 
map will be necessary in general to reduce space 
requirements). Another reason is modularity. Functional 
cohesiveness applied to policy improves ease of mainte- 
nance. 



15 



equation involves a summation, it is essentially describes a 
"switch** that enables one and only one sub-policy. The 
indicator function IXg(s)) serves as the "switch." The cor- 
responding action selection drawn accordingly, e.g., (say) by 
random draw from the actions database according to the 
policy action probability specified by the selected sub- 
policy. This invention improves upon the gated approach by 
replacing the indicator function IXgfs)) with a weighting 
function. 

To summarize, gated policy methods exemplify the prior 
art that is improved upon by this invention. Closely related 
methods are also referred to using terms such as "hierarchi- 
cal learning," "layered control,'* and "modular policies." 
Gated policy methods can compartmentalize learning and 
response based upon the input state, and can also allow 
learning to occur at different levels of analysis. In principle, 
this could be achieved equally well by a monolithic (i.e., 
non-modular) system, albeit at possibly much more compu- 
tation required in practical application. I.e., this type of 
modular policy reduces to a single policy, albeit one 
obtained by piecemeal composition of sub-policies over 
state space. Said again in different terms, the sub-policies do 
not overlap in input space. This constraint is enforced upon 
all gated policy methods, either explicitly (in that policies 



Conditioned policy obtained by reinforcement learning 25 respond to mutually distinct portions of the input space) or 



can be improved further. E.g., it docs not yet permit the 
explicit modeling of particular types of conditioned response 
that localize certain types of conditioning to particular 
regions of stimulus space. Both of these issues benefit from 
a straightforward extension known as a gated policy, as 30 
shown in FIG ID. For a survey of such methods see 
[Kaelbling, Littman, and Moore, 1996]. A gating function 
decides which policy should be switched through and actu- 
ally executed based on the stimulus state. 



implicitly (because of the effects of the gating mechanism 
policies effectively respond to mutually distinct portions of 
the input space). 
D. Drawbacks of Prior Art 

The gated policy approach possesses inherent constraints 
that limit its use. The gated policy approach does not allow 
multiple overlapping policies to be combined in order to act 
upon the stimulus in concert The gated policy approach 
instead selects a single sub-policy by a crisp selection. There 



The "gated behaviors" approach includes a wide variety 35 exist practical applications for which overlapping sub- 

of methods, from single-level masterslave, to hierarchical- policies are very useful. Another drawback of the gated 

level "feudal Q-Iearning" [Dayan and Hinton, 1993]. In policy approach is that it can only select from among 

Maes and Brooks [1990] the policies were fixed and the available policies, it cannot combine them to obtain a 

gating function was learned from reioforcement. compositional policy that is better suited than any of the 

[Mahadevan and Connell 1991] fixed the gating function 40 available policies are individually. 



and trained the policies by reinforcement. [Lin 1993], 
[Dorigo and Colombelti 1994], and [Dorigo 1995] trained 
the policies first and then trained the gating function. Diet- 
terich and Flann explored hierarchical learning of policy 
[Dietterich 1997], [Dietterich and Flann 1997]. Whereas 45 
these prior art references concentrate on learning the modu- 
lar sub-policy information, this invention provides a means 
for combining it in a better way, while still allowing still 
these methods for learning the sub-policy information to be 
applicable. 

Now we formalize the gated policy approach. This will be 
useful for clearly defining the novel features of this inven- 
tion when we formalize its essential features in the specifi- 
cation of the main embodiment below. Let ir* be a gated 
policy over a single level of v sub-policies (if' n it*" 2 ,, .... 
itc,M^, with gating function g : S-*{l,2, . . . , v}, which 
chooses the policy appropriate for the given stimulus state. 
As with the policies previously defined above, this policy 
sets the probabilities associated with action tendencies: 

If if, is to be obtained by a gated selection from a 
(nonhierarchical) set of sub-policies, then 

where for any integer a, I,(a) is an indicator function that is 
equal to 1 when a=i, and 0 otherwise. Note that although this 



This invention allows multiple overlapping policies to be 
combined, and this is the central innovation of this patent. 
Rather than use a crisp selection, this invention employs a 
"soft" mixture of policies. 

Another drawback of the prior art is that the gating 
mechanism cannot smoothly transition from one policy to 
another. The switching mechanism is crisp. If the mecha- 
nism switches from one policy to another that is markedly 
different, the resulting change in the behavior will in general 
50 be markedly different as well. There are many applications 
where it is highly desirable to switch from one control 
regime to another in a smooth fashion. 

This invention allows a controller to effect a smooth 
transition from one policy to another over time. 
E. Example Application Illustrating Drawbacks of Prior Art 
Here is a description of a practical application intended to 
highlight specific drawbacks of the prior art. 

An electronic commerce website currently utilizes several 
servers. Each server controls how resources are to be pre- 
sented to the online shopper. Resources can include product 
descriptions, suggested product recommendations, or prod- 
uct pricing information. Each server wields a policy that 
dictates the probability of presentation over the same set of 
resources. An executive procedure uses this policy to guide 
how these resources are displayed- But each server uses a 
somewhat different type of information to formulate its 
policy. Several such servers are required because each one is 



55 



60 
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especially well-suited for handling particular types of infor- be most effective; rather, a combination of their policies will 
mation. One server observes on-site behavior (e.g., pages yield a recommendatioQ that is better than either one indi- 
viewed, browsing behavior). Another server is aware of the vidually. While the gated policy mechanism is highly 
user's past purchase transaction history. Another server is capable of making best use of its individual sub-policies, it 
able to make recommendations based upon an explicit user 5 is incapable of mixing multiple policies together, 
profile. Each server is a back-end process controller capable ^ OBJECTS AND ADVANTAGES 
of controlling various fronKnd processes, such as display- A ^ Qv ^ view of Novd Feamres of ^ mventiori 
.ng ads, selecting the presentation of content, or making ^ mventioQ cess to mflize over- 
product recommendations. Conceptually there is really just des Scc nQ 2 for a ^ ^ ovcrvicw of the 
a smgle source of information: re the shopper s behavior io ^ mcchanism Overlapping policies occur when mul- 
ln the space defined by shopper behavior, the input space of g. 

these servers "overlap." But because different data structures v _ . , , , ... 

, , f , , • . . can respond ciiec Lively to the same stimulus while map- 
are used to record shopper behavior, each server seems to . r , ' .„ , .. r 
operate en a different type of information. Therefore, at the P in § *> tbc ™ e ° r different pobcy space, or 
most important level-mat being to server the shopper- 15 ma P 10 ^ same white reS p 0ndl ng to stimuli 
these servers are wielding overlapping policies. (This but whlch «cur simultaneously, sach 
example is kept simple for clarity. However, it can be » controllers that react to different sources of infor- 
modified slightly to illustrate the practical reality that such nation. , , , 
servers will often overlap much more explicitly. For is good reason for using overlapping policy It 
example, a shopper answering a questionnaire can result in 20 a process controller to wield multiple utilities. Dif- 
new ^formation being shunted to both "on-site browsing utilities can be used under different circumstances 
behavior" as well as "user questionnaire" data structures.) and . *e process controller can then wield a mixture of 
To reiterate, this example has three servers, each one ■ Intuitively the process controller is able to 
responding to a different type of information source: smoothly apply a multitude of motivational tendencies upon 
...... action selection. An immediate consequence is that the 

1. on-site browsing behavior 25 process controller can combine controUers that operate on 

2. explicit user profile or questionnaire different sources of information. As pointed out by [Sutton 

3. past purchase history 1992] and [Brafman, Tennenholtz, 1996], rational agents are 
In this example, all three servers are necessary because no either (a) maximizers of expected utility or (b) reinforce- 

single server can do the entire job effectively. How can the 3Q ment learners. Process server tasks (e.g., website 

operation of these servers be seamlessly integrated in order personalization) naturally admit multiple "utilities" 

to leverage the best attributes of each one? (respectively, "types of preferences"). These utilities corre- 

Suppose only one type of information is available for the spond to the having multiple objective criteria to be opti- 

visitor (say, there is on-site behavior, but neither explicit mized by the controller (respectively, multiple mental states 

user profile nor past purchase history). In this case it is easy 3S of the user — e.g., attitude, mood, objective, task — or mul- 

to solve the problem at hand: simply select the server that tiplc resources being quantified by the server —e.g., dollars 

responds to on-site behavior. However, if two types of spent, units of product sold, number of page views browsed), 

information are available (say, on-site behavior, and explicit Or they can (say) correspond to different ways of measuring 

user profile) then the situation is made more complex. Given a single criterion (e.g., "user preference" can be measured in 

the prior art the options become: ^ multiple ways, e.g., by first-person subjective opinion via 

1. select one server or the other questionnaire, passive observation of actual tendencies, or 

2. obtain a new server that can utilize both sources of b V comparison to other similar people via collaborative 

information -^ mg) ' • , a v ua« a u ■ 

An additional option would be desirable. If the webmaster ™* canonical gated policy approach defined above is 

could combine the two existing servers together to utilize 45 lackin g m w ^ s: 

them in concert than the task would be handled more (1) It has no explicit represenuuon of multiple sources of 

effectively. Conceptually this reduces to combining two overlapping policy information. 

(possibly overlapping) process control mechanisms. (2) It has no capacity for smoothly integrating multiple 

One benefit of a seamless combination of the two existing policies, 

servers would be to smoothly transition from one server to 50 (3) It has no means for smoothly shifting control from one 

another. A first-time shopper will quickly generate on-site policy to another. 

browsing behavior but won't have past purchase history and These limitations are resolved by this invention, 

may not wish to fill out an explicit user profile. This makes This extension extends modular stochastic control to 

the first server appropriate, and the other two servers com- allow simultaneous application of more than one policy to 

pletely useless. However, once the shopper generates some 55 any particular stimulus (i.e., "overlapping policies"). This 

purchases, the third server becomes useful. But rather than exact framework is novel, however, it is similar in spirit and 

simply switching over to the third server in a radical fashion analogous in approach to the Mixtures of Controllers 

as soon as past purchase history becomes available, this approach [Cacciatore and Nowlan 1994], which is an exteo- 

invention provides a means to migrate smoothly from one sion of the well-known Mixture of Experts approach 

server to the other. The gated policy mechanism is incapable 60 [Nowlan 1990], [Jacobs et al 1991]. One embodiment of the 

of performing this smooth transition. mixture mechanism is a recurrent mechanism analogous to 

Furthermore, a policy obtained by combining the three the mixture mechanism used in the mixture of controllers 

servers can make best use of each server, using them in method, but with additional features that allow it to apply to 

concert rather than relying on only one or the other. In some a mixture of policies. These features handle additional 

cases, the "on-site browsing behavior" server will provide 65 complexities that arise when combining policy information 

the best information. In others, the "explicit user profile" mat are not an issue when combining either (a) single control 

will be most effective. But in yet others, no one server will signals or (b) single recommendations. 
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The Mixture of Controllers approach combines the con- 
trol signals produced by multiple controllers that regulate 
the same control element. Each sub-controller submits a 
single control signal to the mixture mechanism, which 
combines these into a single control signal that is passed on 
to the controlled element. In that approach the combination 
is done on each individual control signal (which in the 
terminology adopted here, corresponds to the control of an 
individual action), whereas this invention combines entire 
policies before the control signal (or alternatively, 
recommendation) is generated. 

Recall that a policy corresponds to an entire set of actions. 
A mixture of policies is more useful for certain practical 
applications because it is directly applicable for stochastic 
selection from a database of discrete actions instead of 
regulating a continuous control signal. For example, this 
invention is more directly applicable to website personal- 
ization tasks than is the Mixture of Controllers approach. 
Also, this invention separates "policy" from its "execution," 
whereas the Mixture of Controllers approach does not. 

From computer science in general and operating systems 
in particular it is well understood that this basic encapsula- 
tion principle has many advantages, analogous to the way 
U.S. government separates the formulation of policy from its 
execution by separating the legislative branch from the 
executive branch. In addition, this invention provides an 
additional mechanism for encapsulating "conflict 
detection," analogous to the judicial branch of the VS. 
government. This conflict detection mechanism preemp- 
tively detects when a policy will generate conflicts during 
execution, and also resolves those conflicts. 

The Mixture of Experts approach is a prior art that 
effectively combines multiple policies; however, the Mix- 
ture of Experts approach operates in "recommendation 
space." This broad class of methods includes (a) voting 
mechanisms, and (b) weighted averaging mechanisms, 
where several "experts" make a recommendation, and the 
several recommendations are consolidated (by voting or by 
weighted average, respectively). This invention differs in 
that the consolidation of expert "opinions'* occurs in policy 
space rather than in the recommendation space. 

The ability to manipulate and combine fuzzy policies has 
additional advantages in that it allows multiple value func- 
tions to be manipulated and combined. Technology for 
converting a single value function into a policy is standard 
fare in prior art cited here. However, prior art does not 
address the combination of multiple value functions (see 
FIG 1G) or the simultaneous collapse of multiple value 
functions into a single stochastic policy (see FIG. II), or the 
convergence of multiple stochastic policies in order to 
obtain a new value function (see FIG 1H). 

The mixing function also has a temporal component for 
regulating the speed of transition of policy over time. See 
FIG U. 

Although we describe the main embodiment of this inven- 
tion with respect to computer-based server applications 
involving multiple process servers wielding discrete policies 
another embodiment of the invention applies to combining 
multiple continuous policies such as those found in some 
electronic controllers. 
B. Practical Advantages of the Invention 

Here we highlight the practical benefits of the novel 
features. Although the illustrative examples described here 
focus on computer-based database server applications, this 
method has applicability in process control in general 
including electronic process controllers. 
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Combining Policies in "Policy Space" 
Combining multiple policies in ''policy space" rather than 
in "recommendation space" delivers additional flexibility 
over the prior art mentioned above. For example, when 

5 mi x ing a probabilistic policy with a deterministic policy 
(having all probability concentrated on a single action), the 
mixture mechanism can let the deterministic policy always 
dominate the probabilistic policy (see FIG IE). In some 
applications this is the preferred result. This reduces to a 
crisp selection of the deterrninistic policy and can be per- 

1 formed adequately by the prior art cited here. The Mixture 
of Policies approach allows this effect, but it also allows the 
alternative option of letting the probabilistic policy "soften" 
the deterministic policy (see FIG. IF). There are applica- 
tions for which this is the preferred result The prior art cited 

1S here does not allow this result. 

Easier to incorporate Conflict Detectors 
Combining multiple policies also allows an additional 
level of separation of policy and execution that is extremely 
advantageous when combining multiple process servers. 

20 FIG. 1G illustrates the combination of two fuzzy policies. 
Note that as defined here a fuzzy policy can "recommend" 
more than one action be triggered simultaneously. An agent 
that formulates a stochastic policy assumes that the execu- 
tive will select only a single action. Therefore, conflicting 

25 actions can be recommended because the conflict is resolved 
by selecting only a single action. On the other hand, an agent 
that recommends a fuzzy policy (as defined here) expects 
more than one action to be selected (in general). Therefore, 
any mixture of multiple fuzzy policies must perform an 
additional check to ensure that no conflicts will arise when 

30 triggering multiple actions. This functionality is the respon- 
sibility of the mixture mechanism referred to here as the 
Mixing Function. 

The result is a separation of "conflict detection and 
resolution" from policy formulation and policy execution. 

3 5 This adds another useful level of modularity to policy-based 
control. 

Combining Policies in Value-Function Space 
A website content server may call upon multiple sub- 
servers that each recommend content for display. One way 

40 to combine these recommendations is to simply combine the 
policy information provided by each sub-server using the 
technique described above, which combines multiple poli- 
cies in policy space. However policy space is not always be 
the best space in which to combine policies. For instance 

45 consider a website that is a portal which "aggregates" 
content from many other sources. Those sources can be 
comprised of search engines, or of content servers located at 
other websites. A "children-friendly** version of the same 
content is desired that imposes a zero value upon pomo- 

50 graphic content. In this case it is required that the probability 
of displaying pornographic content is not just negligible — it 
must be exactly zero. Revaluing all pornographic content to 
zero value can perform this function. Although prior art such 
as simple filtering mechanisms can perform this same 

55 function, this invention allows filtering mechanisms to be 
seamlessly incorporated with other process controllers, to be 
extended to allow "softer" forms of filtering, and to be 
switched on or off at will. Therefore, while the main 
practical advantage of this invention is its ability to combine 

60 policy-based servers in policy space, there are practical 
applications in which the combination is best performed in 
value function space; one embodiment of this invention 
performs the latter task. 
Therefore, because fuzzy policy can be used to represent 

65 value functions, the ability to manipulate and combine fuzzy 
policies has practical advantages for manipulating value 
functions. 
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It allows multiple value functions to be combined and 
then handed off to an action selection mechanism (such 
as a process server) that requires its recommendations 
be provided as a single value-function (see FIG. 1G) 

It allows multiple value functions lo be manipulated and 
combined in order to synthesize a single coherent 
policy that satisfies these multiple value-functions 
simultaneously to some degree (see FIG. II). 

It allows multiple stochastic policies to be mapped back 
into value function space (see FIG. Ill) where they can 
be recombined more easily, more intuitively, or with 
better quality control (e.g., more safely with respect to 
ensuring that undesirable content will not be 
displayed). 

Technology for converting a single value function into a 
policy is standard fare in prior art cited here. However, prior 
art docs not address the combination of multiple value 
functions (see FIG 1G) or the simultaneous collapse of 
multiple value functions into a single stochastic policy (see 
FIG. II), or the convergence of multiple stochastic policies 
in order to obtain a new value function (see FIG. 1H). 

Smooth Transition of Policy Over Time 

The policy mixture mechanism has a temporal component 
for enforcing smooth transition of policies over time. A 
website server controlling a graphical interface needs to 
enforce continuity in order to avoid confusing the user. 
Discontinuity is a definite disadvantage of the prior art for 
combining multiple process servers. This invention provides 
the means to ensure that transition from one policy to 
another is performed seamlessly and smoothly at a rate that 
can be precisely controlled. FIG. 1J provides a simple 
example illustrating the essential elements of this transition 
over time. Although the sub-policies which input to the 
system remain unchanged over time, the mixing function 
adjusts the relative contribution of each policy to achieve a 
smooth transition from one policy to the other. Of course, 
this illustration is a rudimentary depiction; the time units, 
time scale, and number and nature of policies encountered in 
practical application would differ greatly in general. 

Additional Objects and Advantages 

Still further objects and advantages will become apparent 
from a consideration of the ensuing description and accom- 
panying drawings. 

4. SUMMARY OF THE INVENTION 

The invention provides a method and apparatus for com- 
bining a plurality of overlapping policy-based process con- 
trollers via a mixture of policies mechanism. The invention 
is also useful for smoothly transitioning control from one 
controller to another. The invention is also useful for sepa- 
rating conflict detection and resolution from policy formu- 
lation and execution. 

Many signal-processing applications used to control or 
regulate other systems can be treated as "policy-based 
controllers." In particular, the invention is applicable to 
policy-based process servers as well as electronic control- 
lers. A "policy -based" controller admits a conceptual 
decomposition into "policy" and "executive." The policy 
formulated by a policy-based controller is provided lo an 
executive mechanism that then uses that policy to guide how 
it executes actions, such as regulating control signals, trig- 
gering procedures, or regulating ongoing processes or pro- 
cedures. The concept of "policy" is quite useful because the 
task of regulating a policy-based controller reduces to the 
task of regulating the associated policy and the associated 
action selection executive. 



A "policy" can be used to exert probabilistic control but 
can also be used for deterministic control. It can also be used 
for parallel control of multiple control signals, or for trig- 
gering multiple processes in parallel. Because "policy-based 

5 controllers'* can be effectively reduced to their associated 
policy information, this implies that by combining their 
respective policies one can combine the controllers. 

Separating policy from execution facilitates the design 
and development of flexible controllers. Decomposing a 

1Q complex policy into sub-policies facilitates the design and 
development of flexible policies. However, the prior art are 
limited in their methods for handling sub-policy informa- 
tion. The present invention combines the several policy- 
based "sub-servers*' by combining the "sub-policies" asso- 
ciated with each sub-server into a single policy. The system 

15 combines multiple policy-based sub-servers by combining 
the associated distributional information according to a 
measure of relative contribution. The system allows (but 
does not require) temporal smoothing of the policy mixture 
mechanism. The system provides for detection and rcsolu- 

20 tion of conflicts that will arise as a result of combining 
otherwise incompatible sub-policies. The preferred embodi- 
ment combines the sub-servers by combining the respective 
sub-policies, but another embodiment combines the sub- 
servers by combining the respective value functions asso- 

25 ciated with each sub-server. 

A useful characteristic of policy-based controllers is the 
separation of policy formulation from policy execution. This 
invention allows another level of modularity by encapsulat- 
ing the procedures required for detecting and resolving 

30 conflicts that arise as a result of combining otherwise 
incompatible sub-policies. 

The invention is suitable for integrating multiple process 
servers on websites. Examples of website servers include 
content servers, ad servers, and recommendation engines. 

35 Examples of applications for such website servers include 
but are not limited to personalization systems, content 
servers for displaying targeted content, electronic commerce 
product recommendation systems, and ad servers for dis- 
playing targeted advertisements. Method and apparatus is 

40 also suitable for regulating reactive behaviors in social 
agents and virtual personality simulations, such as facial 
expressions, as well as displays of reactive affect in general, 
such as band gestures and other nonverbal body language. 
In another embodiment, the invention may be imple- 

45 mented to provide a method for combining multiple elec- 
tronic controllers. Robotic toys and toy dolls exemplify the 
type of hardware platform that can benefit from the combi- 
nation of multiple simple controllers, rather than the alter- 
native of creating a more complex monolithic controller. The 

50 invention can be used to obtain complex controllers by 
combining multiple simpler controllers. Another embodi- 
ment of the invention can also be used to simplify the design 
and implementation of monolithic controllers by applying 
the engineering design discipline strategies of modulariza- 

55 tion and encapsulation. This allows the designer to more 
easily scale up to greater complexities. This invention pro- 
vides methods for doing so which are more flexible than 
prior art. 

Other applications are apparent to anyone familiar with 
60 the technology and with the benefit of this specification. 

5. DESCRIPTION OF DRAWINGS 

FIGS, 1A through 1J are relevant to the background of 
this invention. These drawings illustrate terminology, intro- 
65 duce important concepts, describe prior art, or explain the 
limitations of prior art FIGS. 2 through 14 describe this 
invention. 
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PRIOR ART slrating the temporal aspect of the mixing function. At t— 2, 

^ . the mixing function gives more weight to Policy B. At t=-l, 

FIG. 1A illustrates a simple stochastic policy. Depicted is mc mixing function has cvolvcdj ^ cqual wcight t0 

a stochastic policy with 5 actions. The action selection each eveD m0 ugh Policy A and Policy B remain 

probabilities sum to 1.0. Although for simplicity we refer to unchanged; the output policy is a simple average. At t-O, the 
"actions" the policy can control procedures by letting 

an mixing function gives more weight to Policy A. This mecha- 

action trigger a procedure. n is m allows control to be switched smoothly from one 

FIG. IB illustrates the special case of stochastic policy regime to another over time. Although this example illus- 

given by a deterministic policy. Depicted is a deterministic trates the concept of smooth temporal regime switching 

policy with 5 actions. The action selection probabilities still 10 using two fuzzy policies, the same basic concept applies to 

sum to 1.0, but 100% of the action selection probability is mixtures of other types of policy, 

amassed upon a single action. DRAWINGS DESCRIBING THIS INVENTION 

FIG. 1C illustrates a simple fuzzy policy. Depicted is a pjc 2 is a conceptual overview of the invention and how 

fuzzy policy with 5 actions. Instead of "action selection it integrates with an acdon selection executive to achieve the 

probability" the policy defines "degree of membership," 15 desired result of serving as a controller or process server, 

which is a distribution that need not be probabilistic. Illustration depicts a mixture of policies. This extends the 

Therefore, the summation of degree of membership cao gated policy approach by modifying the gating mechanism 

exceed 1.0. in two essential ways: I. Crisp selection is replaced by a 

FIG. ID illustrates the prior art for selecting from among mixi . n S faction 2. The mixing function has state (i e is 

a plurality of policies using a gating approach. It also shows 20 persistent), resulting m a functional dependence of the 

how the resulting policy is then passed along to the action «™8 &«ton upon its state over a previous time range, 

selection executive. This depicts the essential operation of Compm and contrast this figure with FIG. ID. 

prior art that utilizes a gated set of policies, the current state FIG- 3 shows the major components of the mvention. The 

of the art in reinforcement learning of stochastic control mixing function takes N policies and generates as output a 

policy. A straightforward extension of this is obtained by 25 policy that depends upon the N input policies and previous 

applying the gating mechanism recursively to each policy to state of the mixing function. The "mixing function" is 

obtain a hierarchical system with multiple levels of abstrac- referred to as a function because at any particular moment m 

^ on time it performs a functional mapping into policy space. 

FIG, IE illustrates the process of letting a "crisp" deter- However, the mixing mechanism computed by the mixing 

ministicpoUcy "dominate" a stochastic policy, a process that 30 function need not be a static function or purely reactive 

can be achieved by prior art as well as bythis invention. For control mechanism-it may invoke procedural code or 

some applications if one policy's recommendation puts all physic processes. 

its weight upon a single action then that action will be FIG. 4A is a block diagram of the major hardware 

preferable. This process can be achieved by prior art as well components and interconnections in accordance with one 

as by this invention 35 embodiment of the invention as a process server. 

FIG. IF illustrates the process of using a stochastic policy FIG. 4B is a block diagram of the major hardware 

to "soften" the crisp recommendation given by a dctermin- components and interconnecuons in accordance with one 

isticpoUcy. This process is easily achieved by this invention embodiment of the invention as an electronic controller 

but using prior art this is at best, more difficult to implement, FIG. 5 is a block diagram of the major components of this 

and at worst, not at all possible. In this illustrative example, 40 invention specifically related to the data flow and control 

the output policy is a simple average of the two input flow aspects of the operation of one embodiment of the 

policies. invention. 

FIG. 1G illustrates the concept of combining a plurality of FIG. 6 shows one embodiment of how to construct the 

fuzzy policies using a simple combination of two fuzzy 45 Policy Database shown m FIG. 5 in terms of tables and their 

policies. In this illustrative example, the output policy is a associated schemas. 

simple average of the two input policies. FIG. 7 is a flowchart of an operational sequence to 

FIG. 1H illustrates the concept of combining a plurality of ^ plurality of control policies in accordance with 

stochastic policies to obtain a fuzzy policy using a simple onc embodiment of the mvention. 

combination of two stochastic policies. In this illustrative 50 FIG. 8 is a flowchart of an operational sequence to 

example, the output policy is obtained in two steps: (a) combine a plurality of control policies in accordance with 

Apply a winner-take-all selection mechanism to each sto- one embodiment of the mvention. 

chastic policy, resulting in Actions 1 and 4 being selected. FIG. 9 is a flowchart of an operational sequence to 

(b) Add Actions 1 and 4 to the output fuzzy policy using their combine a plurality of control policies in accordance with 

weighting under the individual stochastic policies to specify 55 one embodiment of the invention. 

their weight in the fuzzy policy. (The Mixing Function FIG. 10 is a flowchart of an operational sequence to 

possesses additional functionality required to resolve con- combine a plurality of control policies in accordance with 

facts that is not illustrated by this example.) one embodiment of the invention. 

FIG. II illustrates the concept of combining a plurality of FIG. 11 is a flowchart of an operational sequence to 

fuzzy policies to obtain a stochastic policy using a simple 50 combine a plurality of control policies in accordance with 

combination of two fuzzy policies. In this illustrative onc embodiment of the invention. 

example, the output policy is a simple average of the two FIG. 12 is a flowchart of an operational sequence to 

input policies normalized to convert the distribution into a combine a plurality of control policies in accordance with 

probability distribution. one embodiment of the invention. 

FIG. 1J illustrates the concept of a mixture of policies 65 FIG. 13 is a flowchart of an operational sequence to 

evolving over time using a simple combination of two combine a plurality of control policies in accordance with 

stochastic policies that evolves over 3 time steps. Demon- one embodiment of the invention. 
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FIG. 14 is a flowchart of an operational sequence to 
combine a plurality of control policies in accordance with 
one embodiment of the invention. 

FIG. 15: a perspective view of an exemplary signal- 
bearing medium in accordance with one embodiment of the 5 
invention. 

6. DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

This section provides a detailed static description of the 
preferred embodiments. To better understand the compo- 
nents and methods of the invention a general statement of 
the relationships, nomenclature, and environment used to 
implement the embodiments of the invention follows in ^ 
sections A through C. Thereafter, the apparatuses, methods, 
and signal bearing mediums of the present invention are 
described. 

A. Introduction 

See FIG. 2 for a conceptual overview of the invention. See 
the section above titled Formal Definition of Prior Art for 
definition of basic terms such as "policy" and "stimulus." 

B. Formal Definition of the Mixture of Policies Framework 
Let S represent the space of all possible stimuli, modeled 

as a subset of a real-valued Euclidean space. The "mixed" ^ 
policy n m ! is composed of v sub-policies (ji™- 1 ,, it™' 2 ,,. • ■ , 
jc™'",), along with the "mixing function" 

where space E is the set of permissible mixture distributions 30 
over policy space: 

£<=[(U]-- 

We define E as a hypercube without loss of generality — 
one could certainly use an arbitrary subset of v-dimensional 35 
Euclidean space but this extension is trivial because it 
provides no apparent advantage in and of itself and unnec- 
essarily complicates the operation of the mixing function. 
The stimulus space employed here include external stimuli 
as well as intemally7 stored representations of previous 40 
stimuli or internally generated state information. Mixture 
distributions are not restricted to probability distributions, 
although that is certainly allowed. The mixing function g"* 
is similar to the gating function g defined previously, but 
rather than choose a single policy appropriate for the given 45 
stimulus state, the "soft" gating function given by g™ can 
apply a mixture of policies. Furthermore, whereas g is 
stateless, gT is indexed by the mixture state space E. We 
define the "mixture state" as a point in E, but we expect that 
a "mixture state" per se would be modeled by correspon- 50 
deuce to regions within E. 

Mixed policy is defined to be adaptable, but for simplicity, 
during execution of the policy ir* 1 , upon a given aeA (action 
space), seS (stimulus space), we may refer to it"", (s,a) rather 
than the more strictly appropriate 3thu mXs,a,a*,s'). p o r ease 55 
of description it suffices to let j£* t be a linear weighted sum 
of the v sub-policies, so that for aeA, s,eS, 

vrtfty^^tfrtiTtf, ft- Jl 

where g,„-g m (s ( _ 1 , g^), integer i<t, g-(go,gi, ■ . ■ ,g v ), 60 
geE, and it" 1 , is the policy function for the X th sub-policy. The 
mixing function g^ associates a scalar value with each 
sub-policy that defines its participation. Hierarchical mix- 
tures of policies are available in other embodiments of the 
invention. Recursive application of the mixtures of policies 65 
mechanism is available in other embodiments of the inven- 
tion. 
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A hierarchical mixture of policies is readily obtained by 
recursive application of the main concept, i.e., by decom- 
posing sub-policies into mixtures of "sub-sub-policies." 

FIG. 2 depicts a conceptual description of the mixture of 
policies method and its relation to the action selection 
mechanism. 

Having provided an easily understood embodiment, wc 
now provide the preferred embodiment of the general mix- 
ture mechanism: 

(a.) Specify the v sub-policies at time t {if , ' i r }, i«l, 
2, . . . , v, where rf" J t is the policy function for the i* 
sub-policy at time t. 
fb.) Specify the #A actions aeA, and the stimulus s,eS. 
(c.) Specify the set of permissible mixture distributions 

over v-dimcnsional policy space: Ec[0,l] v . 
(d.) Specify the "recursive mixing function" g": SxE->E, 
such that for stimulus s^S, and mixing value heE, 

(e.) Specify the value of the mixing function at the 
previous time step t-1 represented by the recursive 
function h r -^ m (s l _ J ^^j), such that the recursion is 
finite such that h^ is defined to take a value in E. 

(f.) Specify the decomposition of the mixing function 
value geE into its v components g-(go>gi» ■ . - & v \ su cn 
that given stimulus s,eS at time t, said decomposition is 
given by 

(g.) Compute the functional composition f of the v sub- 
policies {jl" v f }, i-1,2, v, and the v-dimensional 
mixing function g"\ taking its value in E such that 
given stimulus st, action a, and previous mixing value 
h„ f computes the policy weighting for action a given 
the stimulus: 

*-<CwWb"k, *<X {^MM ■ ■ - . * 

A special case of this uses the non-recursive mixing 
function g": S-»E, such that for stimulus s^S, g"(S()eE, 
thereby f computes the Linear weighted sum of toe v sub- 
policies as weighted by g": 

C. The Policy Mixture Mechanism 

FIG. 3 gives a description of the major sub-components of 
the policy mixture mechanism, and illustrate how the mixing 
function module fits into the overall mechanism. 

Mixed policy differs from gated policy in that policy is 
computed by "mixing" sub-policies according to a "mixture 
mechanism." This "mixture mechanism" furthermore has 
persistence. Suppose reinforcement learning is disabled at 
time t. Let f*, denote an action selection function that accepts 
a stimulus s, and gated policy jr* and selects an action 
aof*(ji*(si)). Let f r denote an action selection function that 
accepts a policy it"*, generated by this invention and selects 
an action a-P/n^Xs,, g"*,)). Abusing notation slightly to 
make the point clearer, let f^X^-PX^Xs,)) and f"» Xs>f", 
(ic^XSp B™0)- Note that for the gated policy approach the 
resulting policy is static, so that and s,-s tJBiJ i-1,2, . . . , 
implies i^TaH****/^ wd therefore f*(s>f* w ( «J, i-1, 
2, . . . Note that because g*" possesses state, s,=s t+1 does not 
imply rXs>r r+ Xs f J, i-1 A ■ • • . because in general nrfs p 

g">jtm™ ^s lW , tfVi)- 

The mixture state can be regulated to deliver effects not 

possible under the gated approach. Certain mixture states 
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can thereby be more persistent than others, such that, e.g., a 
policy can be biased to follow a particular mixture of 
policies most of the time, except for occasional excursions 
to other points in mixture state. Transition within mixture 
state is regulated by a smoothness condition upon g" that 
specifies the speed of transition within E dependent upon 
location within E. One embodiment of this invention is to let 
e^iCs-g*) * E, bcB, stS, g.tE, such that (g"*, Js,g m >g ( )<P 
(gj for some p:E-*R. This mechanism explicitly models the 
"duration" of a behavioral regime in the preferred embodi- 
ment of the invention. 

Here is one embodiment of the recursive mixing function 
g*" 1 SxE-»E. For stimulus s,eS, and mixing value reE, 
g^s^E, define the non-recursive mixing function g: S-*-E 
such that for stimulus s^eS, g(s,)eE. Next, let the value of the 
mixing function at the previous time step t-1 be represented 
by the recursive function h^-^s^jji^j, such that the 
recursion is finite so that ho is defined to lake a value in E. 
Let the scalar value x e[0,l] and the scalar value y=l-x. 
Specifying the function q r E^E such that qXh„ b^, 
h^i, . . . , hj) eE, define the recursive update function such 
that given stimulus s^S at time t, 

Another embodiment takes g^s,, b r )=xg(s,))+y h r 
D. Hardware Components and Interconnections 

One embodiment of the invention utilizes a signal pro- 
cessing system 300 for combining the policy information 
generated by a plurality of controllers, which may be 
embodied by various modular components and interconnec- 
tions as described in FIG. 3. 

Referring to FIG. 3, a signal processing system 300 is 
illustrated. In the architecture shown, the apparatus 300 
includes N signal processing devices 301, 302, 320, and 303, 
which function as policy-based controllers. Here N is some 
finite integer number. The fact that these arc "policy-based 
controllers" implies that by combining their respective poli- 
cies one can effectively combine the controllers. Each con- 
troller provides policy information via interconnections 307, 
308, and 309 to the Mixing Function module 315. State 
information, including external stimulus and internal state 
memory is provided by module 306. In accordance with a 
timing module 304 and slate 306 the policies associated with 
the input signal processing devices 301, 302, 320, and 303 
are combined and transmitted by interconnection 310 to 
result in the Output Policy 311. This process occurs repeat- 
edly over time. 

1. Digital Database Processing Systems 

One embodiment of this signal processing system is a 
digital data processing apparatus 400 for analyzing 
databases, as illustrated in FIG. 4A. Referring to FIG. 4A, a 
plurality of server computers 401, 402, and 403 provide 
policy information that is stored in a policy database 404 
contained within server computer 400. For example, a server 
computer 401 transmits policy information 405 to a server 
computer 400 by depositing the policy information 405 into 
a policy database 404. 

In one embodiment, the server computers 400-403 may 
be personal computers manufactured by Gateway 
Incorporated, and may use an operating system manufac- 
tured by Microsoft Corporation. Or, the server computers 
may be Unix computers running the Linux operating system. 
Or, the server computers 400—403 may be hosted by a single 
computer containing a plurality of CPU processors. Or, 
server computers 401-403 may be independent process 
servers represented as separate software applications run- 
ning within a single computer utilizing a single CPU, or 



10 



15 



20 



25 



30 



40 



45 



50 



60 



utilizing a plurality of CPUs. Server computers 400-403 
may incorporate a database system, such as ORACLE, or 
may access data on files stored on a data storage medium 
such as disk, CDROM, DVD, or tape. 

FIG. 4A shows that, through appropriate methods and 
procedures 406 the behaviors of server computers 401-403 
are combined by server computer 400 and transmitted to 
server computer 407. Data access programs and procedures 
406 access data generated by servers 401-403 via Policy 
Database 404. Other server computers, process servers, 
application servers, computer architectures, or database sys- 
tems than those discussed may be employed. For example, 
the functions of server computer 407 may be incorporated 
into server 401, and vice versa. Methods and procedures 406 
integral to server computer 400 may be housed separately 
from other methods and procedures integral to the server 
computer 400 illustrated in this embodiment. For example, 
server computers 401-403 may be housed within a single 
database processing system that includes Policy Database 
404, and methods and procedures 406 may be housed within 
server computer 407. 

Other embodiments may employ yet other architectures. 
For example, the functions of server computer 407 may be 
incorporated into server 401, and vice versa. Different 
embodiments of this invention may utilize different numbers 
of servers. The server computers may have different func- 
tions (such as personalization system, content server, or ad 
server), such that for example, server computer 401 may be 
an ad server, and server computer 402 may be a content 
server. They all may have similar functions (e.g., all being 
ad servers). 

2. Electronic Signal Processing Systems 

Another embodiment of this signal processing system is 
an electronic controller 410 illustrated in FIG. 4B. Referring 
to FIG. 4B, an electronic controller 410 may be analog or 
digital in operation, and contain a plurality of subcontrollers 
411, 412, and 413. Subcntrollers 411, 412, or 413 may each 
be an entire chipset, or may each be a single CPU, or may 
all be housed within a single CPU. Sub-controllers 411, 412, 
and 413 may be general purpose CPUs such as the Pentium 
III sold by Intel Corporation, or the 68332 microprocessor 
sold by Motorola. Alternatively, subcontrollers 411, 412, and 
413 may be special-purpose chipsets utilized in robotic toys 
such as those manufactured by IS Robotics Corporation, or 
in electronic circuitry made available to experimenters and 
robot enthusiasts by Diversified Enterprises (of Santa Bar- 
bara Calif.). Each sub-controller provides Control Policy 
information deposited within a central repository referred to 
here as the Controller Policy Interface 414. The Policy 
Integrator 416 accesses policy information from the Con- 
troller Policy Interface 414, combines it and outputs it to a 
Master Electronic Controller 417. Other architectures may 
be employed. For example, the functions of Controller 
Chipset 411 may be combined with the functions of Con- 
troller Chipsets 412 and 413. The functions of the Master 
Electronic Controller 417 and the Electronic Controller 410 
may be combined within a single CPU, or within a single 
chipset. The functions of Electronic Controller 410 and the 
sub-controllers 411-413 may be combined within a single 
CPU or within a single chipset. Policy Integrator 416 may be 
contained in Electronic Controller 407, and Controller 
Policy Interface 414 may be contained in a single CPU along 
with 411-413, Different embodiments of this invention may 
utilize different numbers of sub-controllers. 

3. General Signal Processing System 

The embodiments of FIGS. 4A and 4B can be conceptu- 
alized within a common framework as illustrated in FIG. 5. 
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The Policy Database 404 and Controller Policy Interface 
414 correspond to the Policy Database 501. The Policy 
Integrators 406 and 416 correspond to the Mixture of 
Policies Server 504. The Mixture of Policies Server 504 
outputs its results to the Output Policy repository 506, which 5 
is accessed by the server computer 407 or by the Master 
Electronic Controller 417, respectively, depending upon 
which of the two embodiments depicted in FIGS. 4A and 4B 
are employed. Other hybrid architectures are possible by 
employing a combination of components drawn from the 10 
embodiments depicted in FIGS. 4A and 4B. 

4. One Embodiment of the Policy Database 
If the Policy Database 501 is represented using a database 

file system, then in this embodiment the component tables 
that comprise the Policy Database may be constructed as 15 
depicted in FIG. 6. Table 601 defines the specification of the 
policy specification table. Table 602 defines the schema of 
the policy table. Table 603 defines the schema of the action 
table. Tables 601, 602, and 603 could be represented by 
alternative arrangements. They could be stored in flat files, 20 
or represented in logic stored within electronic circuitry or 
in another information storage device. 

5. Other Embodiments will be Apparent to Skilled Artisans 
Despite the specific foregoing description, ordinarily 

skilled artisans having the benefit of this disclosure will 25 
recognize that the apparatus discussed above may be imple- 
mented in a machine of different construction, without 
departing from the scope of the invention. As a specific 
example, one of the components 413 may be eliminated. Or, 
the server computer 403 may be integral to server computer 30 
407, or it may include server computer 401, or may handle 
the functions of Policy Integrator 406, or include all of 
server computer 400. Regardless of the configuration of the 
resulting machine, the signal processing system comprised 
by the machine contains several distinct control mechanisms 35 
that arc consolidated into a single control mechanism in a 
particular manner corresponding to a "mixture of policies." 
The manner in which this "mixture of policies" mechanism 
is achieved is described in the next section. 

40 

7. OPERATION OF INVENTION 

In addition to the various hardware embodiments 
described above, a different aspect of the invention concerns 
an operational method for combining a plurality of control 
policies ("sub-policies") to create an output result that, in a 45 
particular sense, comprises a mixture of the sub-policies. By 
this method it is possible to combine a plurality of electronic 
controllers, or to combine a plurality of digital computer 
process servers. ^ 

A descriptive overview of a single iteration for a general 
embodiment of the invention is shown in FIGS. 7-14. A 
high-level specification of this iterative process is given by 
FIG. 7. Subsequent FIGS. 8-14 refine or elaborate upon 
modules depicted in FIG. 7. 55 

For ease of explanation, but without any limitation 
intended thereby, the examples of FIGS. 7-14 are described 
in the context of the process server computer system 400 
described above and illustrated in FIG. 4 A. 

A. Embodiments of the General Method & 

This procedure may be implemented, for example, by 
operating a server computer 400 shown in FIG. 4A to 
execute a sequence of machine-readable instructions. These 
instructions may reside in various types of data storage 
medium. Data storage medium may comprise of random 65 
access memory (RAM) contained within server computer 
400. Alternatively, the instructions may be contained in 
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another data storage medium such as a magnetic data storage 
diskette 1500 as shown in FIG. 15. Whether contained in 
server computer 400 or elsewhere, the instructions may be 
instead stored on an alternative data storage medium such as 
a direct access storage device (hard drive, RAID array, 
CD-ROM, DVD disk, WORM), solid state electronic 
memory such as RAM, or sequential access memory such as 
magnetic tape, paper punch cards, or punch-hole tape. 

These instructions may be encoded using various types of 
programming languages such as C, C++, Fortran, Java, 
Javascript, BASIC, Pascal, Perl, TCU or other similar 
programming or scripting language. These instructions may 
be in the form of machine-readable instructions such as 
compiled Java bytecode, or in uncompiled Javascript. In this 
respect, one aspect of the present invention concerns an 
article of manufacture, comprising a data storage medium 
tangibly embodying a program of machine-readable instruc- 
tions executable by a digital processing system to perform 
the operational steps to combine a plurality of server pro- 
cesses. 

In another embodiment of this operational procedure may 
be implemented by operating an electronic controller 410 in 
FIG. 4B to execute machine logic. This machine logic may 
reside in various types of data storage medium. Data storage 
medium may comprise of random access memory (RAM, 
ROM, or EPROM) contained within electronic controller 
410 or accessible to electronic controller 410 by a data 
interconnection. Whether available within electronic con- 
troller 410 or via interconnection to external storage 
medium, the instructions may be contained in other data 
storage media, such as a magnetic data storage diskette 1500 
as shown in FIG. 15, or in a direct access storage device 
(hard drive, RAID array, CD-ROM, DVD disk, WORM), 
solid state electronic memory such as RAM, or sequential 
access memory such as magnetic tape, paper punch cards, or 
punch-hole tape. 

These instructions may be encoded using various types of 
programming languages such as C, C++, Fortran, Java, 
Javascript, BASIC, Pascal, Perl, TCL, or other similar 
programming or scripting language. These instructions may 
be in the form of machine-readable instructions such as 
compiled Java bytecode, or in interpreted Javascript In this 
respect, one aspect of the present invention concerns an 
article of manufacture embodying a system of machine-logic 
executable by a signal processing system to perform the 
operational steps to combine a plurality of electronic con- 
trollers. 

B. The General Method 

As mentioned above, FIGS. 7-14 show a sequence of 
method steps illustrating the method aspects of the inven- 
tion. Readers familiar with the particular methodology asso- 
ciated with stochastic control will readily understand the 
following detailed descriptions. Readers familiar with the 
general methodology associated with an information science 
(e.g, computer science, computer programming, computer 
architecture, operating systems science, control systems 
science, electrical engineering, economics, econometrics, 
mathematical programming, electronic engineering) will 
readily be able to understand the following detailed descrip- 
tions. Readers familiar with the general methodology asso- 
ciated with an engineering discipline related to signal pro- 
cessing systems (e.g, computer science, computer 
programming, computer architecture, electrical engineering, 
electronic engineering) will be able to implement the fol- 
lowing detailed descriptions in a physical realization such as 
one of the embodiments described above. 
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For ease of explanation, but without any limitation 
intended thereby, the examples of FIGS. 7-14 are described 
in the context of the process server computer system 400 
described above and illustrated in FIG. 4A. 
1. Process Specification for main Embodiment 

Referring to FIG. 7, the general method of the invention 
begins in step 701. In this example the control flows 
sequentially through four main modules 703-707. Input 
from stimulus 702 and Input Policy Database 710 is pro- 
cessed and the result is deposited in Output Policy Database 
709. 

a) Intialize Module 

Control flow begins in step 701 and proceeds to step 703. 
Step 703 initializes key parameters. Step 703, 
"INITIALIZE," is described in further detail in FIG. 8. 
Referring to FIG. 8, certain parameters are specified within 
the INITIALIZE module itself, whereas the values of other 
parameters are determined by querying input processes. Step 
801 queries the Input Policy Database 710 to identify the 
number v of sub-policies represented within Input Policy 
Database 710. The types of policies contained in Input 
Policy Database 710 are also identified and a memory 
variable set to be one of three cases: 1. Input Policy Database 
contains only stochastic policies. 2. Input Policy Database 
contains only fuzzy policies. 3. Input Policy Database con- 
tains at least one stochastic policy as well as at least 1 fuzzy 
policy. (A Deterministic Policy is a special case of Stochas- 
tic Policy and is treated as a Stochastic Policy.) 
Step 802 queries the Output Policy Database 709 to 
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network literature as "momentum update'' methods and 
includes the particular method described in the previous 
paragraph immediately above. Another more general set of 
methods is commonly referred to in computational finance 
literature and electrical engineering literature as "moving 
average" methods. These and other useful methods readily 
apparent to the skilled artisan fall within the range of valid 
specifications for the mixing function. 

Step 804 initializes variable E: the space of permissible 
mixing distributions. The range of valid specifications for E 
is precisely defined above in the section titled "Formal 
Definition of the Mixture of Policies Framework." 

c) Initialize Time-dependent Memory Variables 
Step 806 initializes variables to track the time t, the time 

step increment t, and the memory variables and s^ that 
remember the mixing function and stimulus for the previous 
time step, respectively. 

d) Apply Mixing Function Module 
Step 806 passes control to step 704 in FIG. 7, the module 

titled "APPLY MIXING FUNCTION." This module is 
described in further detail in FIG. 9. Referring to FIG. 9, the 
APPLY MIXING FUNCTION module begins in step 901 
and continues to step 903, which retrieves stimuli s ( from 
step 902. Step 904 computes the mixture of policies weight- 
ing for each action a in A. The mixing function g is 
computed according to the specification g™ determined in 
step 803, taking the current stimulus s, and previous mixing 
value as its input parameters. Step 904 ensures that the 
resulting mixing value g m (s / , g^J is in E and if not projects 
it to be the closest permissible value in E. Utilizing E to 
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determine the type of Output Policy. A memory variable is 30 constrain the range of valid mixing values yields practical 
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set to one of two cases: 1. Stochastic, or 2. Fuzzy, depending 
upon the type of Output Policy. Step 803 initializes the 
mixing function specification g™. One embodiment of this 
invention allows g™ to be retrieved from external memory 
storage; however, for this example of the general method the 
mixing function specification is retrieved from internal 
Program Memory 711 internal to the process server com- 
puter 400. Henceforth, we will assume that Program 
Memory 711 is read-write accessible to all steps in FIGS. 
7-14. 

b) Specification of the Mixing Function and Mixing Space 
The range of valid specifications for mixing function g™ 
is precisely defined above in the section titled "Formal 
Definition of the Mixture of Policies Framework." We 
provide a specific embodiment here. Recall that one embodi- 45 
ment of Jt* is as a linear weighted sum of the v sub-policies, 
so that for acA, s,eS, 

where g r . r g™(s_ 1 , g^), integer i<t, g-Cgo,^, . . ,g v ), 
geE, and jr" 1 '', is the policy function for the v** sub-policy. 
Let h: S-»E. This means that h is a function that takes as one 
of its input parameters a stimulus seS and outputs a 
v-dimensioned vector (h a , h 2> . . . , h v ). Let b r _ 1 =h(s # _ 1 ), and 
b,=h(s,). Now let g^s* g,_i) =0.1 h^+0.9 g^,. Intuitively, 
b responds to a stimulus and provides a "target" location in 
mixing space towards which the mixing function g" 1 moves. 
If the stimulus remains unchanged over two successive time 
steps t and t-1, then h^-h, and the mixing function will 
steadily move closer to h r This is achieved by adding 10% 
of h^j to 90% of g^ via vector addition. In other 
embodiments, the numbers 0.1 and 0.9 could be replaced by 
any other two real number a and b such that a+b-1.0. 

There exist other more general methods for smoothly 
updating the mixing vector in this fashion to incorporate a 
dependence upon its values over previous time steps. One 
such class of methods is commonly referred to in neural 



benefits, and methods for implementing this constraint will 
be apparent to the skilled artisan. In this embodiment we 
simply let E=fO,l] v , thereby allowing all possible mixing 
values to be permissible. 

Given current stimulus s,, for each a in A step 904 
computes Jt"*Xs>«»a)- The set n / «*{rf"Xs r ^), a in A} comprises 
the (preliminary) Output Policy. This is considered the 
"preliminary'' Output Policy because in practical application 
there may be conflicts that need to be resolved (this is 
handled in step 705). Step 905 stores this policy into the 
Output Policy Database. Step 906 exits the APPLY MIXING 
FUNCTION module and passes control to module 705, 
DETECT and RESOLVE CONFLICTS. 

e) Detect and Resolve Conflicts Module 
Module 705 is described in further detail in FIG. 10. 

Referring to FIG. 10, control enters the module in step 1001 
and proceeds to step 1002. Step 1002 is a comparison 
operation that branches the flow of control depending upon 
the value of the variable set in step 802, the Output Policy 
Type. If the Output Policy Type is "Fuzzy" Step 1002 
transfers control to step 1003, otherwise control is trans- 
ferred to step 1004. 

f) Resolve Static Intra-Policy Conflicts Module 
Step 1003 is module RESOLVE STATIC INTRA- 
POLICY CONFLICTS, which is described further in FIG. 
11. Referring to FIG. 11, control enters the RESOLVE 
STATIC INTRA-POLICY CONFLICTS module in step 
1101 and proceeds to step 1102. Step 1102 is a conditional 
branch that determines whether or not all actions in A can be 
performed simultaneously. A fuzzy policy can be used to 
trigger a plurality of actions in parallel, and actions triggered 
simultaneously can cause conflicts. Therefore, this step is 
appropriate because the branching condition in step 1002 
"Output Policy TypeoFuzzy" took a value of TRUE, so the 

65 output policy type has been determined to be a fuzzy policy. 
Step 1102 preemptively determines whether a set of 
actions {a lt aj, . . . , a„} can be triggered simultaneously. The 



50 



60 



10/14/2003, EAST Version: 1.04.0000 



US 6,473,851 Bl 

25 26 

range of possibilities under which such conflicts can occur proceeds to step 1004, RESOLVE CONFLICTS WITH 

depends greatly upon the specific application. For example, ONGOING ACTIONS. Or, control can also be passed to 

some airplanes cannot move their left aileron up and the step 1004 from step 1003. 

right aileron up at the same time. However, for ease of Step 1004 is described further in FIG. 12. Referring to 

explanation and concreteness we describe a specific mccha- 5 FIG . 12 control is initiated in step 1201 and proceeds to step 

nism for implementing this step. Let P(A) be a set of subsets 1202. Step 1202 performs methods and procedures for 

of A. If a subset {a lt . . . , a„} exists in P(A), then those detecting conflicts that would arise if actions under the 

n actions are permissible. Now create set {a la . . . > a„} current output policy were to be performed simultaneously 

by examining n, and identifying the actions a in A for which with ongoing procedures. 

jTXs n a) is nonzero. For this embodiment, nT^a^O 10 The RESOLVE CONFLICTS WITH ONGOING 

implies that a is not triggered under n r Next, determine ACTIONS module of FIG. 12 is similar to the RESOLVE 

whether {a„ . . . , a„} is in P(A). If it is, there is no STATIC INTRA-POLICY CONFLICTS of FIG. 11. 

conflict. If it is not, there is a conflict. However, rather than handling conflicts between actions 

A skilled artisan may imagine applications for which this within the current output policy IT,, it handles conflicts that 

simple mechanism is either inadequate or inappropriate, ts will arise between actions under the current output policy II, 

Furthermore, a skilled artisan will create more sophisticated and other ongoing actions. These other ongoing actions 

mechanisms for preemptively detecting conflicts among could be initiated by server computer 400 during previous 

candidate actions. This may be done with a reactive con- time steps under a previous policy (e.g., output policy n,.^ 

trailer (i.e., a controller such as a black box function that for some time t-T). Or these other ongoing actions could be 

simply examines a set of actions and outputs a function 0 or 20 outside of the control of server computer 400. Referring to 

1 depending upon whether the actions can be triggered FIG. 11, control enters the RESOLVE CONFLICTS WITH 

simultaneously or not) or with an algorithmic procedure. ONGOING ACTIONS module in step 1201 and proceeds to 

Either way, the result will be to determine whether all step 1202. Step 1202 is a conditional branch that determines 

actions {a a , a^ . . . , a„} that can be triggered under policy whether or not all actions in A can be performed simulta- 

II, can be performed simultaneously or not. 25 neously with ongoing actions. 

If all actions under the current output policy IT, can be Let II, be the current output policy, and let the set of 

performed simultaneously then there are no conflicts to actions {a lf a2, ... ,a„} be those actions that can be triggered 

resolve and control passes to step 1103, exiting the by n r For ease of explanation we describe a specific 

RESOLVE STATIC INTRA-POLICY CONFLICTS module mechanism for implementing step 1202. Let B represent the 

and passing control to step 1004. Otherwise, control pro- 30 set of ongoing actions [b lr b^ - . . , b^}, where R is the 

ceeds to step 1104. Step number of ongoing actions. Let C-AxB give the Cartesian 

Step 1104 specifies a method for resolving the conflicts product of sets A and B. Let P(C) be a set of subsets of C. 

detected in step 1102. In general, the main responsibility of If a subset {a 1? a^ . . . , a B , b lt b^, . . . , b R } exists in P(C), 

this step is to resolve conflicts by eliminating the possibility then those n+R actions are permissible. Now create set {a 3 , 

that two actions can be triggered which would cause a 35 a 2 , . . . , a„} by examining TL t and identifying the actions a 

conflict if performed simultaneously. The four steps sped- in A for which is nonzero. For this embodiment, 

fied in step 1104 provide one embodiment for achieving this iir£s^)=Q implies that a is not triggered under IT,. (Note 

goal. The basic approach is to apply a linear order to all that this is true for stochastic policy as well.) Next: 

actions a in A Let this linear order be denoted by qCA)-^, 1 . if the output policy type is "Fuzz/' determine whether 

. . . , q^), where M-#A, the number of actions in A, and 40 {a„ a^ . . . , a,,, b j, b^ . . . , bjj is in P(A). If it is, there is 

for i-1,2, . . . ,M, q,- is an integer taking values in the range no conflict. If it is not, there is a conflict. 

[0,M]. This linear order simply specifies which actions are 2. If the output policy type is "Stochastic" determine 

preferable to others if forced to choose between conflicting whether {a^ bj, b 2 , . . . , b R } is in P(A) for each i-1,2, . . 

actions. The active actions under policy IT f are sorted . ,n. If this is so for each action under the output policy there 

according to q(A) and the least important action that poses 45 is no conflict. Otherwise there is a conflict, 

a conflict is deleted from II, and policy Jl t is updated to Note that we only need to determine whether a single 

reflect this modification. If the new policy contains no action conflicts with ongoing actions for the stochastic 

conflicts then control passes to step 1105. Otherwise, this policy, because a stochastic policy only triggers a single 

procedure is repeated until no conflicts remain under n r action at one time step. On the other hand, a fuzzy policy can 

This conflict resolution strategy could be modified in a 50 simultaneously trigger a plurality of actions, so we need to 

large number of ways depending upon the practical appli- check all combinations of actions under the output policy 

cation and theoretical constraints. The skilled artisan would against ongoing actions. 

typically customize the procedure described here when A skilled artisan may imagine applications for which this 

applying this invention. For example, the procedure sped- simple mechanism is either inadequate or inappropriate and 

fied for this particular embodiment admits numerous van- 55 be able to create other mechanisms for preemptively detect- 

ants that are apparent to the skilled artisan. For example, the ing conflicts between candidate actions and ongoing actions, 

linear order q(A) that assigns a measure of "importance" to Regardless, the result will be to determine whether all 

each action could have additional dependencies. For actions {a Jf a 2 a„} that can be triggered under policy 

example, it could depend upon time, it could depend upon IT, can be performed simultaneously with ongoing actions. If 

the current stimulus, or it could depend upon the current 60 so, then control passes to step 1203, exiting the module and 

output policy IT^ passing control to step 1005 because no conflict resolution 

Upon performing the methods and procedures in step is necessary. Otherwise, control proceeds to step 1204. 

1104 control is passed to step 1105, exiting this module and Step 1204 specifies a method for resolving tbe conflicts 

passing control to step 1004. detected in step 1202. In general, the main responsibility of 

g) Resolve Conflicts with Ongoing Actions Module 65 this step is to resolve conflicts by eliminating the possibility 

Referring back to FIG, 10, if the branching condition in that actions that can be triggered by the current output policy 

1002 "Output Policy Type»Fuzzy" is not TRUE then control would cause a conflict if performed simultaneously with 
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ongoing actions. The six steps specified in step 1204 provide If step 1302 detects no potential for conflict between the 

one embodiment for achieving this goal. The basic approach current policy and ongoing actions then step 1303 passes 

is similar to that specified for step 1104. We apply a linear control to step 1304, exiting this module. Otherwise, control 

order to all actions a in A. Let this linear order be denoted passes to step 1305. Step 1305 contains a method for 

by q(AMqi> <\2> • ■ ■ > wnere M-#A, the number of 5 resolving the conflicts detected in steps 1302 and 1303. Step 

actions in A, and for i«lA . . . ,M, q, is an integer taking 1305 is exactly analogous to step 1204. Note that the 6 steps 

values in the range [0,M]. This linear order simply specifies described in step 1305 in FIG. 13 reuse the same 6 steps 

which actions are preferable to others if forced to choose described for step 1204 in FIG. 12. 

between conflicting actions. The active actions under policy When step 1305 is ^pleted control * XS JJ?*' 

n arc sorted according to q(A) and the least important 10 which exits the RESOLVE SEQUENCING CONFLICTS 

^^ZS^Il^^is^fiom WITH IMMEDIATELY PRIOR ACTIONS module^ and 

T-f , ,. T7 ■ Jr*-« »k;« «,«H.*fi«tm« Tft»w> transfers control to step 1006, exiting the DETECT and 

ILandpohcyn ^^^.^^^^' If ^ RESOLVE CONFLICTS module, and transferring control 

new policy contains no conflicts then control passes to step OUTPUT POLICY RESULT module. 

1205. Otherwise, this procedure is repealed until no conflicts Output Policy Result Module 

remain under IT given the current ongoing actions^ 15 This module is described further in FIG. 14. Referring to 

This conflict resolution strategy could be modified in a nG 14 enters step 1401 and proceeds t0 step 1403 , 

number of ways depending upon the practical application which slores me up d a ted version of the current output policy 

and theoretical constraints. The skilled artisan would typi- n to mc o^p^ p 0 iicy Database 709. Step 1404 records the 

cally customize the procedure described here when applying va i ue 0 f current memory variables that are required for the 

this invention. For example, the procedure for this particular 20 next time step within the Program Memory 711. Control 

embodiment admits numerous variants that arc apparent to proceeds to step 1405, exiting this procedure and transfer- 

the skilled artisan. For example, the linear order q(A) that ring control to step 707 in FIG. 7. 

assigns a measure of "importance" to each action could have j) Continue? 

additional dependencies. For example, it could depend upon S tep 707 contains a stopping mechanism to determine 

time, it could depend upon the current stimulus, or it could 25 whether to continue or exit. If continuing then control 

depend upon the current output policy II r Furthermore, if proceeds to step 704, otherwise, it proceeds to step 708. The 

server computer 400 has the capability to abort ongoing general method ends in step 708. 

actions, then another strategy is to selectively abort ongoing 2. Other Embodiments will be Apparent to Skilled Artisans 

actions until ongoing actions present no imminent conflict While what have been shown are considered to be the 

with the current output policy. Additionally, hybrid schemes 30 P referred embodiments of the invention, it will be apparent 

arc possible which selectively abort some ongoing actions as 10 ^ose skilled in the art that various changes and modifi- 

well as delete actions from the current output policy. cations can be made herein without departing from the scope 

Upon performing the methods and procedures in step of toe invention as defined by the appended claims. 

1204 control is passed to step 1205, exiting this module and In particular, specific steps which admit a variety of 

passing control to step 1005. 35 alternative embodiments too numerous to specify here but 

h) Resolve Sequencing Conflicts With Immediately Prior which are apparent to the skilled artisan include the follow- 

Actions Module m S ^P 51 

Step 1005, RESOLVE SEQUENCING CONFLICTS 1. Steps 102 and 1104 

WITH IMMEDIATELY PRIOR ACTIONS, is described in 2. Steps 1202 and 1204 
further detail in FIG. 13. Referring to FIG. 13, the module 40 3. Step 1302 and 1305 

starts in step 1301 and proceeds to step 1302. 4. Step 1403. 

In this embodiment, actions that have the potential to These steps are related in that they are involved in conflict 
create conflicts with future actions are labeled as "ongoing" detection or conflict resolution. In practical applications 
actions until they have completed. This information is stored there may be an uncountable number of particular methods 
in Program Memory 711. This way conflicts created by these 45 f or detecting conflicts or resolving conflicts that may arise, 
actions can be detected by the procedure titled "RESOLVE Different embodiments of this invention may utilize differ- 
CONFLICTS WITH ONGOING ACTIONS". However, it is cnt conflict management schemes. Regardless of the par- 
possible for some conflicts caused by actions triggered in the ticular conflict management scheme used in the resulting 
immediately prior step to be missed due to timing effects. machine, the signal processing system comprised by the 
This sequencing permissibility check catches those sequeoc- 50 machine may contain: (a) A module encapsulating the con- 
ing conflicts that arc missed due to timing effects. flict management duties, and separating them from policy 

The embodiment for this module is exactly analogous to formulation and policy execution, (b) A method for lever- 
that for step 1004, the RESOLVE CONFLICTS WTTH aging the availability of a plurality of overlapping policies in 
ONGOING ACTIONS module. The steps required are: order to help detect and resolve policy conflicts before 

1. Consult Program Memory 711 to determine which 55 handing a control policy off to an executive for execution, 
actions) were triggered at the previous time step (by Also, the skilled artisan will understand that the general 
the executive in accordance to the output policy rec- method allows the applicability of the theory of functionals, 
ommendation for that time step). If these actions were also known as "distributions" or "generalized functions" as 
not in the set of ongoing actions handled in step 1004, described in Chapter 9 of [Folland 1984]. This theory 
treat them as if they are actually ongoing actions. 60 provides several general methods for combining distribu- 

2. Resolve conflicts using method directly analogous to tions. The skilled artisan will understand how such methods 
step 1004, the RESOLVE CONFLICTS WITH ONGO- can be implemented under tbe general method of this 
ING ACTIONS module. invention. Distributions in this general sense resemble func- 

This treats actions triggered during the immediate time tions but are more singular. See for instance the discussion 
step as "ongoing actions" regardless of whether or not they 65 in [Folland 1984] of "tempered distributions" and methods 

are detectable by any other means as ongoing actions, for it for composing a distribution from a plurality of tempered 

is possible that they arc still in process of being initiated. distributions. 
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8. CONCLUSION, RAMIFICATIONS, AND 
SCOPE OF INVENTION 

Thus the reader will see that the invention provides a 
highly flexible method that can be used by skilled artisans to 
combine a plurality of policy-based controllers, or to com- 
bine a plurality of policy-based process servers. The inven- 
tion can also be used skilled artisans to provide policy-based 
controllers and policy-based process servers with better 
regime-switching capability — i.e., the ability to detect that 
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(a.) providing a first input information transmitting device 

representing an input control stimulus, 
(b.) providing a second input information transmitting 
device representing a plurality of input control policies, 
(c.) providing an output information transmitting device 

representing an output control policy, 
(d.) combining said input control policies into said output 
control policy, such that more than one said input 
control policy may simultaneously influence said out- 
put control policy for said input control stimulus, 
(e.) transmitting said output control policy via said output 
information transmitting device, whereby said method 
will combine said input control policies by functional 
composition to obtain a single output control policy, 
whereby more than one said input control policies are 
able to simultaneously influence the said output 
control policy for the said input control stimulus, 
whereby said method can iterate over time, 
whereby said method will allow said output control 
policy to smoothly transition control from being 
influenced substantially by one of said input control 
policies to being influenced by substantially another 
of said input control policies, 
whereby said method will allow said output control 
policy to smoothly transition control from being 
influenced substantially by one particular functional 
composition of said input control policies to being 
influenced substantially by another functional com- 30 
position of said input control policies, 
whereby said method will allow the combination of a 
plurality of input control policies for the purpose of 
consolidating that information into a form suitable 
for use by a policy-based "action selection execu- 35 
live" (i.e., a policy-based "controller"). 

2. The method recited in claim 1, further including 
(a.) providing an information storage device which is able 

to store the plurality of input control policies recited in 
claim 1, 40 
(b.) means for storing the plurality of input control 
policies recited in claim 1 into said information storage 
device, 

(c.) retrieving the plurality of input control policies 
recited in claim 1 from said information storage device 45 
and making this information available to the method 
recited in claim 1 via the second input information 
transmitting device recited in claim 1, 
whereby said method may be encapsulated thereby 
allowing physical separation of input control policies 
and the information generating devices that gener- 
ated the input control policies. 

3. The method recited in claim 1, further including 
(a.) providing an information storage device which is able 

to store the input control stimulus recited in claim 1, 
(b.) storing the input control stimulus recited in claim 1 

into said information storage device, 
(c.) retrieving said input control stimulus from said infor- 
mation storage device and making this information 60 
available to the method recited in claim 1 via the first 
input information transmitting device recited in claim 
1, 

whereby said method may be encapsulated thereby 
allowing physical separation of input control stimuli 65 
and the information generating devices that gener- 
ated the input control stimuli. 
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4. The method recited in claim 1, further including 
(a.) providing an information storage device which is able 

to store the output control policy recited in claim 1, 
(b.) storing said output control policy into said informa- 
tion storage device, 
(c.) retrieving said output control policy from said infor- 
mation storage device and making this information 
available externally via said output information trans- 
mitting device representing a persistent copy of the 
output control policy recited in claim 1, 
whereby said method may be encapsulated thereby 
allowing physical separation of the method and the 
information processing devices that utilize the output 
control policy. 

5. The method recited in claim 1, further including 
(a.) providing an action distribution input information 

transmitting device capable of identifying a plurality of 
actions, 

(b.) providing an information processing device capable 
of applying a control policy to said plurality of actions 
and selecting a single output action from said plurality 
of actions, 

(c.) providing an action output information transmitting 
device capable of describing or identifying an action, 

(d.) utilizing said information processing device to the 
task of selecting an output action from said plurality of 
actions via said action distribution input information 
transmitting device according to the output control 
policy recited in claim 1, 

(e.) transmitting said action via said output information 
transmitting device, 

whereby said method may be utilized to select said 
output action from said plurality of actions described 
or identified by said input information transmitting 
device, 

whereby the said selection of output a from the said 
plurality of actions depends upon the output control 
policy, 

whereby said selection can occur repeatedly over time, 
whereby said method may be utilized as a controller by 

using it to select said output action in a fashion 

dependant upon said input control policies and make 

this selection externally available, 
whereby said method may be utilized as a stochastic 

controller. 

6. The method recited in claim 5, further including 

(a.) providing an action distribution input information 

storage device capable of containing a description of a 

plurality of actions, 
(b.) storing a description of said plurality of actions into 

said action distribution input information storage 

device, 

(c) retrieving said descriptions of said plurality of actions 
from said action distribution input information storage 
device and making this information available to the 
method recited in claim 5, 

whereby said method may be encapsulated thereby 
allowing physical separation of the method and the 
information processing devices that utilize said 
method as a controller, 

whereby said method may be encapsulated thereby 
allowing physical separation of the method and the 
information processing devices that generate input 
control policy information that is provided as input 
to the method, 

whereby information representing or identifying 
actions referred to by the method or controlled by the 
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method can be maintained and utilized internally 
within the method, 
whereby said method may be utilized as a stochastic 
controller and cleanly encapsulated as a distinct 
information processing system. 

7. The method recited in claim 1, further including 

(a.) providing an action distribution input information 
transmitting device capable of identifying a plurality of 
input actions, 

(b.) providing an information processing device capable 
of applying the output control policy recited in claim 1 
to said plurality of input actions and selecting a plu- 
rality of output actions from said plurality of input 
actions, 

(c.) providing a fuzzy action distribution output informa- 
tion transmitting device capable of describing a plural- 
ity of output actions, 

(d.) utilizing said information processing device to the 
task of selecting a plurality of output actions from said 
plurality of input actions via said action distribution 
output information transmitting device according to the 
output control policy recited in claim 1, 

(e.) transmitting said plurality of output actions via said 
fizzy action distribution output information transmit- 
ting device, 

whereby said method may be utilized to select a 
plurality of output actions from a plurality of input 
actions described or identified by said action distri- 
bution input information transmitting device, 

whereby the said selection of plurality of output actions 
from the said plurality of input actions depends upon 
the output control policy recited in claim 1, 

whereby said selection can occur repeatedly over time, 

whereby said method may be utilized as a controller by 
using it to select a plurality of output actions in a 
fashion dependant upon said input control policies 
and make this selection externally available, 

whereby said method can be utilized as a fuzzy con- 
troller. 

8. The method recited in claim 7, further including 

(a.) providing an input information storage device capable 
of containing a description of a plurality of stored input 
actions, 

(b.) storing a description of said plurality of stored input 
actions into said information storage device, 

(c.) retrieving said descriptions of said plurality of stored 
input actions from said information storage device and 
making this information available to the method recited 
in claim 7, 

whereby said method may be encapsulated thereby 
allowing physical separation of the method and the 
information processing devices that utilize said 
method as a controller, 

whereby said method may be encapsulated thereby 
allowing physical separation of the method and the 
information processing devices that generate input 
control policy information that is provided as input 
to the method, 

whereby information representing or identifying 
actions referred to by the method or controlled by the 
method can be maintained and utilized internally 
within the method, 

whereby said method can be utilized as a fuzzy con- 
troller and cleanly encapsulated as a distinct infor- 
mation processing system. 
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9. The method recited in claim 1, further including 

(a.) detecting potential conflicts that may arise from use of 
output control policy to drive action selection policy, 
such that such conflicts can be determined by process- 
ing information available within said output control 
policy, 

(b.) resolving said potential conflicts by modification of 
the output control policy, whereby said method will 
detect conflicts that may arise from combining said 
plurality of input control policies recited in claim 1 and 
which are evident by examining the output control 
policy, 

whereby said method will modify said output control 
process to free of said conflicts. 

10. The method recited in claim 1, further including 
(a.) detecting potential conflicts that may arise from use of 

output control policy to drive action selection policy in 
the presence of ongoing actions detectable by the 
method, 

(b.) resolving said potential conflicts by modification of 
the output control policy, 

whereby said method will detect conflicts that may 
arise from combining said plurality of input control 
policies recited in claim 1 and which could trigger 
conflicts with ongoing actions, 

whereby said method will modify said output control 
process to be free of said conflicts. 

11. The method recited in claim 1, further including 
(a.) detecting potential conflicts that may arise from use of 

output control policy to drive action selection policy 
given actions previously triggered by the method as a 
result of operation thereof over previous iterations, 
(b.) resolving said potential conflicts by modification of 
the output control policy, 

whereby said method will detect conflicts that may 
arise from combining said plurality of input control 
policies recited in claim 1 and which could trigger 
conflicts with actions previously triggered by the 
method or as a result of its operation, 

whereby said method will modify said output control 
process to be free of said conflicts. 

12. The method recited in claim 1, further including 
(a.) detecting potential conflicts that may arise from use of 

output control policy to drive action selection policy in 
the presence of ongoing actions detectable by the 
method, 

(b.) resolving said potential conflicts by aborting or modi- 
fying one or more ongoing actions, 
whereby said method will detect conflicts that may 

arise from combining said plurality of input control 

policies recited in claim 1 and which could trigger 

conflicts with ongoing actions, 
whereby said method will resolve said conflicts by 

modifying or aborting the offending ongoing actions. 

13. The method recited in claim 1, further including 
(a.) detecting potential conflicts that may arise from use of 

output control policy to drive action selection policy 

given actions previously triggered by the method as a 

result of operation thereof, 
(b.) resolving said potential conflicts by modifying or 

aborting previously triggered actions, 

whereby said method will detect conflicts that may 
arise from combining said plurality of input control 
policies recited in claim 1 and which could trigger 
conflicts with actions previously triggered by the 
method or as a result of its operation, 

whereby said method will resolve such conflicts by 
modifying or aborting offending previously triggered 
actions. 
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14. The method recited in claim 5, further extending the 
method to apply to a multitude of actions, 

(a.) providing an input information storage device capable 
of containing a plurality of descriptions of policy 
information for a multitude of actions, each said 
description of policy information contained in said 
plurality of descriptions of policy information repre- 
sented as a compact statistic 

(b.) storing said plurality of input control policies into said 
information storage device, 

(c.) retrieving said compact statistics from said informa- 
tion storage device and transmitting to said method, 

(d.) combining said plurality of compact statistics such 
that the plurality of input control policies represented as 
compact statistics can be combined, 
whereby said method can be applied to large action 
sets. 

15. The method recited in claim 6, further extending the 
method to apply to a multitude of actions, 

(a.) providing an input information storage device capable 20 
of containing a plurality of descriptions of policy 
information over a multitude of actions, each said 
description of policy information contained in said 
plurality of descriptions of policy information repre- 
sented as a compact statistic, 

(b.) storing said input control policies into said informa- 
tion storage device, 

(c.) retrieving said compact statistics from said informa- 
tion and transmitting to said method, 

(d.) combining said plurality of compact statistics such 
that the plurality of input control policies represented as 
compact statistics can be combined, 
whereby said method can be applied to large action 
sets. 

16. The method recited in claim 1 further extending the 
method to apply hierarchically to the results of a plurality of 
applications of the method, 

(a.) providing a plurality of server information processing 

devices capable of implementing said method, 
(b.) providing a master information processing device 

capable of implementing said method, 
(c.) interconnecting said server information processing 
devices to said master information processing device, 
whereby said method can be applied by using as input 
the plurality of outputs as computed by distinct 
instances of said method in a hierarchical manner, 
whereby said method can be applied in a hierarchical 

fashion using a plurality of hierarchical levels, 
whereby said method can apply different embodiments 
of the invention at different levels within the hierar- 
chy. 

17. The method recited in claim 1 further extending the 
method to apply recursively to the results of a plurality of 
applications of the method, 

(a.) providing an information processing device capable 
of implementing said method, 

(b.) providing an information storage device capable of 
storing output results of said information processing 
device as applied to computing said method, 

(c.) storing said output results from information process- 
ing device into said information storage device, 

(d.) retrieving said output results from said information 
storage device into said information processing device, 

(e.) applying information storage device to the task of 65 
computing a plurality of implementations of the 
method, 



whereby said method can be applied recursively to 

compute a plurality of instances of said method, 
whereby said plurality of instances of said method can 
be computed and stored for subsequent use as input 
5 to a subsequent instance of said method 

18. An article of manufacture comprising a data storage 
medium tangibly embodying a program of machine- 
readable instructions executable by a digital processing 
apparatus to perform method steps for combining policy 

to information, the method steps comprising: 

(a.) Identifying v sub-policies at time t: {rf"*^}, i=l, 
2, . . . ,v, where is the policy function for the i rt 
sub-policy at time t, for t-0, 1, 2, ... , 
(b.) Specifying a set of actions A where the number of 
actions in A is denoted by #A and actions in A are 
denoted by aeA, and specifying a stimulus s,«S, 
(c.) Specifying a set of permissible mature distributions 
over v -dimensional policy space: Ec I v , where I is the 
real-valued interval from 0 to 1 inclusive (i.e., includ- 
ing 0 and 1 as endpoints), and l v is the v-dimensional 
space obtained by taking cross-products of 1, 
(d.) Specifying a v -dimensional "recursive mixing func- 
tion" g" 1 : SxE— E, such that for the stimulus s^eS, and 
mixing value hcE, g^s^^E, where each dimension of 
said gfX-,-) is denoted by g^X-.'X i-l^.- * • > v > 
(e.) Specifying a value of the recursive mixing function at 
the previous time step t-1 represented by the recursive 
function K^i^iit-d* su<± that the recursion is 
finite such that for t-O, h,=ho and ho is defined to take 
a value in E, 

(f.) Computing a functional composition of the v sub- 
policies {it™'',}' . . . ,v, and the v-dimcnsional 
recursive mixing function h,), given the stimulus 
s, and the previous mixing value h,, 
whereby the plurality of v sub-policies can be subse- 
quently combined according to said mixture distri- 
butions, 

19. The article of claim 18 further specifying: 
(a.) Specifying a nonrecursive mixing function g^: S-*E, 

such that for the stimulus s,eS, g^(s,)tE, where each 
dimension of said v-dimensional £(.,-) is denoted by 
*/<-,-), i-1 A • • ,v, 
(a.) Computing a linear weighted sum of the v sub- 
policies: 



whereby said plurality of v sub-policies can be com- 
bined using said linear weighted sum. 
20. The article of claim 18, further specifying a program 
of machine-readable instructions executable by a digital 
processing apparatus to perform method steps for combining 
policy information, the method steps comprising: 
(a.) Specifying a recursive mixing function g"*: SxE-*E , 
such that for stimulus s,eS, and mixing value reE, 
g^eE, 

(b.) Specifying a nonrecursive mixing function g: S-*E, 
such that for stimulus s^S, g(s^eE, 

(c.) Specifying a value of the mixing function at the 
previous time step t-1 represented by the recursive 
function K^i^Mt-df such that the recursion is 
finite such that ho is defined to take a value in E, 

(d.) Specifying at time t a scalar value xel and a scalar 
value y=l-x, 
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(e.) Specifying a function q,: E'- 
h^i hi) eE. 



•E such that q/h^ h t . 



10/14/2003, EAST Version: 1.04.0000 



US 6,473, 

37 

(f.) Computing a recursive update function such thai 
given stimulus s/S at time t, 

whereby the mixing function can smoothly transition 
over time, 

whereby the mixing function can exhibit a dependency 
upon previous values of the mixing function, 

whereby the mixing function can allow (but does not 1Q 
require) the effect of temporal persistence upon the 
mixing function depending upon selection of param- 
eter x and function q, for given t. 

21. The medium recited in claim 20 further specifying 
(a.) Computing a moving average update of the mixing 15 

function such that given stimulus s,fS at time t, 

whereby the mixing function can smoothly transition 
over time, 

whereby the mixing function can have a moving aver- 
age dependency upon previous values of the mixing 
function, 

whereby the mixing function can allow (but does not 
require) the effect of temporal persistence upon the 25 
mixing function depending upon selection of param- 
eter x. 

22. The article of claim 18 wherein 

computing the functional composition of the v sub- 3Q 
policies {jt" ,i ,}, i=lA - • - > v i and me v-dimensional 
recursive mixing function gffe, h% given the stimulus 
s, and the previous mixing value h, according to the 
following: 

whereby the plurality of v sub-policies can be com- 
bined via a possibly nonlinear functional composi- 
tion and allowing although not requiring the effect of 
temporal persistence depending upon previous val- 40 
ucs of the mixing function. 

23. The article of claim 22 further specifying 
(a.) A linear weighted sum of the v sub-policies: 

whereby said plurality of v sub-policies can be com- 
bined using said linear weighted sum and incorpo- 
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rating the effect of recursive temporal persistence 
allowing the current mixing value to depend upon 
previous values of the mixing function. 
24. A method of combining a plurality of input value 
functions, comprising: 
(a.) providing an input information transmitting device 

representing an input control stimulus, 
(b.) providing an input information transmitting device 

representing a plurality of input value functions, 
(c.) providing an output information transmitting device 

representing an output value function, 
(d.) combining said input value functions into said output 
value function, such that more than one said input value 
function may simultaneously influence said output 
value function for said input control stimulus, 
(e.) transmitting said output value function via said output 
information transmitting device, 
whereby said method will combine said input value 
function by functional composition to obtain a single 
output value function, 
whereby more than one said input value function are 
able to simultaneously influence the said output 
value function for the said input control stimulus, 
whereby said method can iterate over time, 
whereby said method will allow said output value 
function to smoothly transition control from being 
influenced substantially by one of said input value 
functions to being influenced by substantially 
another of said input value functions, 
whereby said method will allow said output value 
function to smoothly transition control from being 
influenced substantially by one particular functional 
composition of said input value functions to being 
influenced substantially by another functional com- 
position of said input value functions, 
whereby said method will allow the combination of a 
plurality of input value functions for the purpose of 
consolidating that information into a form suitable 
for use by a policy -based "action selection execu- 
tive" (i.e., a policy-based "controller") that is able to 
convert a value function into a control policy, and 
then use that control policy to automate the execu- 
tion of action selection. 

***** 
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